Descriptor Reference
A complete reference for all protein descriptors available in the protpy package, grouped by category.
Setup
pip install protpy
import protpy
# Load a sequence from a FASTA file
from Bio import SeqIO
with open("protein.fasta") as f:
protein_seq = str(next(SeqIO.parse(f, "fasta")).seq)
Composition Descriptors
Amino Acid Composition (AAComp)
Fraction of each of the 20 standard amino acids in the sequence.
result = protpy.amino_acid_composition(protein_seq)
# Shape: 1 x 20
# A C D E F ...
# 6.693 3.108 5.817 3.347 6.614 ...
Dipeptide Composition (DPComp)
Frequency of all 400 possible dipeptide (two-residue) combinations.
result = protpy.dipeptide_composition(protein_seq)
# Shape: 1 x 400
# AA AC AD AE AF ...
# 0.72 0.16 0.48 0.4 0.24 ...
Tripeptide Composition (TPComp)
Frequency of all 8000 possible tripeptide (three-residue) combinations.
result = protpy.tripeptide_composition(protein_seq)
# Shape: 1 x 8000
# AAA AAC AAD AAE AAF ...
# 1 0 0 2 0 ...
Grand Average of Hydropathy (GRAVY)
Mean of the Kyte-Doolittle hydropathy values across all residues. A positive value indicates overall hydrophobicity; negative indicates overall hydrophilicity.
result = protpy.gravy(protein_seq)
# Shape: 1 x 1
# GRAVY
# -0.045
Aromaticity
Fraction of aromatic residues (F, W, Y, H) in the sequence.
result = protpy.aromaticity(protein_seq)
# Shape: 1 x 1
# Aromaticity
# 0.118
Instability Index
Stability classifier based on dipeptide instability weight values (DIWV). Values below 40 indicate a stable protein; 40 or above indicates instability.
result = protpy.instability_index(protein_seq)
# Shape: 1 x 1
# InstabilityIndex
# 31.836
Isoelectric Point
Estimated pH at which the protein carries no net charge, calculated iteratively using standard pKa values for ionisable residues.
result = protpy.isoelectric_point(protein_seq)
# Shape: 1 x 1
# IsoelectricPoint
# 5.412
Molecular Weight
Average molecular weight of the protein calculated from residue masses, corrected for water lost at each peptide bond.
result = protpy.molecular_weight(protein_seq)
# Shape: 1 x 1
# MolecularWeight (Da)
# 139122.355
Charge Distribution
Positive, negative, and net charge contributions of ionisable residues at a given pH using the Henderson-Hasselbalch equation.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
float |
|
pH at which to calculate charge |
# Default pH 7.4
result = protpy.charge_distribution(protein_seq)
# Custom pH
result = protpy.charge_distribution(protein_seq, ph=6.0)
# Shape: 1 x 3
# PositiveCharge NegativeCharge NetCharge
# 99.526 114.956 -15.43
Hydrophobic/Polar/Charged Composition (HPC)
Percentage of residues belonging to each of three physicochemical groups: hydrophobic (A, C, F, I, L, M, V, W, Y), polar (G, N, Q, S, T), and charged (D, E, H, K, R).
result = protpy.hydrophobic_polar_charged_composition(protein_seq)
# Shape: 1 x 3
# Hydrophobic Polar Charged
# 44.542 32.669 18.247
Secondary Structure Propensity (SSP)
Average Chou-Fasman propensity values for alpha-helix, beta-sheet, and random coil conformations across all residues.
result = protpy.secondary_structure_propensity(protein_seq)
# Shape: 1 x 3
# Helix Sheet Coil
# 0.983 1.05 1.043
k-mer Composition
Frequency of all possible k-length residue subsequences, expressed as a percentage of total k-mers.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Length of each k-mer |
# Default k=2 (dipeptide frequencies)
result = protpy.kmer_composition(protein_seq)
# Custom k
result = protpy.kmer_composition(protein_seq, k=3)
# Shape: 1 x 20^k (e.g. 1 x 400 for k=2)
# AA AC AD ...
# 0.797 0.159 ... ...
Reduced Alphabet Composition
Amino acid composition after mapping residues to a reduced alphabet of
physicochemical groups. Supported alphabet sizes: 2, 3, 4, 6.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Number of reduced groups |
# Default alphabet_size=6
result = protpy.reduced_alphabet_composition(protein_seq)
# Custom size
result = protpy.reduced_alphabet_composition(protein_seq, alphabet_size=4)
# Shape: 1 x alphabet_size
# Group_1 Group_2 Group_3 Group_4 Group_5 Group_6
# 25.339 34.741 9.163 9.084 10.837 10.837
Motif Composition
Count of occurrences (including overlapping) of biological sequence motifs matched via regular expressions. Eight built-in motifs are used by default; a custom list can be supplied.
Default motifs:
Column |
Pattern |
Biological meaning |
|---|---|---|
|
|
N-linked glycosylation site |
|
|
Integrin-binding RGD motif |
|
|
ER retention signal |
|
|
Zinc-finger CxxC motif |
|
|
CAAX prenylation box |
|
|
cAMP/PKA phosphorylation site |
|
|
Dileucine lysosomal sorting signal |
|
|
PEST degradation signal |
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
list or None |
|
Custom regex patterns; uses built-in 8 if |
# Default built-in motifs
result = protpy.motif_composition(protein_seq)
# Custom motif list
result = protpy.motif_composition(protein_seq, motifs=[r'RGD', r'N[^P][ST]'])
# Shape: 1 x len(motifs)
# NxST_glycosylation RGD_integrin KDEL_retention CxxC_zinc_finger ...
# 23 0 0 2 ...
Amino Acid Pair Composition
Frequency of all 400 residue-pair combinations with column names annotated by the physicochemical class of each residue (Hydrophobic, Polar, Charged, or Other).
result = protpy.amino_acid_pair_composition(protein_seq)
# Shape: 1 x 400
# AA_Hydrophobic-Hydrophobic AA_Hydrophobic-Polar AA_Hydrophobic-Charged ...
# 0.797 0.159 ... ...
Aliphatic Index
A measure of the relative volume occupied by aliphatic side chains (Ala, Val, Ile, Leu). Higher values indicate greater thermostability. Formula: AI = Ala% + 2.9×Val% + 3.9×(Ile%+Leu%).
result = protpy.aliphatic_index(protein_seq)
# Shape: 1 x 1
# AliphaticIndex
# 82.725
Extinction Coefficient
The molar extinction coefficient at 280 nm, calculated from the number of Trp (W), Tyr (Y), and Cys (C) residues. Reported for both reduced (no disulfide bonds) and oxidized (all Cys paired) states.
result = protpy.extinction_coefficient(protein_seq)
# Shape: 1 x 2
# ExtCoeff_Reduced ExtCoeff_Oxidized
# 140960 143335
Boman Index
Sum of solubility values for amino acids divided by sequence length, predicting potential for protein–protein interactions. Positive values suggest membrane-binding or interaction potential.
result = protpy.boman_index(protein_seq)
# Shape: 1 x 1
# BomanIndex
# 0.119
Aggregation Propensity
Estimates aggregation-prone regions via a sliding-window approach combining Kyte–Doolittle hydrophobicity and charge neutrality. Returns the count of qualifying windows and the fraction of the sequence covered.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Sliding window size |
|
float |
|
Minimum mean hydrophobicity |
|
int |
|
Maximum charged residues per window |
result = protpy.aggregation_propensity(protein_seq)
# Shape: 1 x 2
# AggregProneRegions AggregProneFraction
# 58 11.793
Hydrophobic Moment
The mean and maximum hydrophobic moment across sliding windows, using the Eisenberg hydrophobicity scale and a helical-wheel projection. Captures amphipathicity of putative helix segments.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Sliding window size |
|
int |
|
Residue rotation angle in degrees (100° for α-helix) |
result = protpy.hydrophobic_moment(protein_seq)
# Shape: 1 x 2
# HydrophobicMoment_Mean HydrophobicMoment_Max
# 0.272 0.813
Shannon Entropy
Information-theoretic measure of amino acid diversity in a sequence. Computed as:
where \(p_i\) is the fractional frequency of each amino acid type present. A value of 0 indicates a completely repetitive (single amino acid) sequence; the theoretical maximum of \(\log_2(20) \approx 4.322\) bits corresponds to a perfectly uniform distribution across all 20 canonical amino acids. Widely used as a low-complexity filter and diversity measure in ML feature pipelines.
result = protpy.shannon_entropy(protein_seq)
# Shape: 1 x 1
# ShannonEntropy
# 4.163
Pseudo Amino Acid Composition (PAAComp)
Augmented amino acid composition that incorporates sequence-order effects via correlation factors derived from physicochemical properties. Reduces the dimensionality problem of pure sequence-order information while retaining more sequence information than simple AAComp.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Number of sequence-order correlation factors to include |
|
float |
|
Weighting factor for correlation layers |
|
list |
|
AAIndex accession numbers to use (uses built-in set if empty) |
# Default parameters
result = protpy.pseudo_amino_acid_composition(protein_seq)
# Custom parameters
result = protpy.pseudo_amino_acid_composition(protein_seq, lamda=10, weight=0.1)
# Shape: 1 x (20 + lamda) → 1 x 50 with defaults
# PAAC_1 PAAC_2 PAAC_3 ...
# 0.127 0.059 0.111 ...
Amphiphilic Pseudo Amino Acid Composition (APAAComp)
Extension of PAAComp that uses both hydrophobicity and hydrophilicity properties to capture amphiphilic patterns (dual hydrophobic/hydrophilic character) along the sequence.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Number of sequence-order correlation factors |
|
float |
|
Weighting factor for correlation layers |
|
list |
|
AAIndex accession numbers (defaults to hydrophobicity + hydrophilicity) |
# Default parameters
result = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq)
# Custom parameters
result = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq, lamda=15, weight=0.3)
# Shape: 1 x (20 + 2*lamda) → 1 x 80 with defaults
# APAAC_1 APAAC_2 APAAC_3 ...
# 6.624 3.076 5.757 ...
Autocorrelation Descriptors
Autocorrelation descriptors measure the correlation between physicochemical property
values of residues separated by a lag distance along the sequence. By default, 8
AAIndex properties are used, generating lag × 8 = 240 features.
Default properties:
AAIndex ID |
Property |
|---|---|
|
Normalised average hydrophobicity |
|
Average flexibility indices |
|
Polarizability parameter |
|
Free energy of solution in water |
|
Residue accessible surface area in tripeptide |
|
Residue volume |
|
Steric parameter |
|
Relative mutability |
Moreaubroto Autocorrelation (MBAuto)
Uses raw property values as the basis for correlation measurement.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum lag distance |
|
list |
(8 defaults above) |
AAIndex accession numbers |
|
bool |
|
Normalise output values |
# Default parameters
result = protpy.moreaubroto_autocorrelation(protein_seq)
# Custom parameters
result = protpy.moreaubroto_autocorrelation(protein_seq, lag=15, properties=["CIDH920105"])
# Shape: 1 x (lag × len(properties)) → 1 x 240 with defaults
# MBAuto_CIDH920105_1 MBAuto_CIDH920105_2 ...
# -0.052 -0.104 ...
Moran Autocorrelation (MAuto)
Uses the deviation from the mean property value, making it mean-centred and thereby less sensitive to the absolute property scale.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum lag distance |
|
list |
(8 defaults above) |
AAIndex accession numbers |
|
bool |
|
Normalise output values |
# Default parameters
result = protpy.moran_autocorrelation(protein_seq)
# Custom parameters
result = protpy.moran_autocorrelation(protein_seq, lag=15)
# Shape: 1 x (lag × len(properties)) → 1 x 240 with defaults
# MAuto_CIDH920105_1 MAuto_CIDH920105_2 ...
# -0.07786 -0.07879 ...
Geary Autocorrelation (GAuto)
Uses squared differences between property values at each lag, making it sensitive to local dissimilarities rather than global correlation.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum lag distance |
|
list |
(8 defaults above) |
AAIndex accession numbers |
|
bool |
|
Normalise output values |
# Default parameters
result = protpy.geary_autocorrelation(protein_seq)
# Custom parameters
result = protpy.geary_autocorrelation(protein_seq, lag=10, normalize=False)
# Shape: 1 x (lag × len(properties)) → 1 x 240 with defaults
# GAuto_CIDH920105_1 GAuto_CIDH920105_2 ...
# 1.057 1.077 ...
Conjoint Triad Descriptor
Conjoint Triad (CTriad)
Encodes the sequence using a 7-class reduced amino acid alphabet and computes the frequency of all consecutive three-residue combinations (triads). The 7 classes are: (1) AGV, (2) ILFP, (3) YMTS, (4) HNQW, (5) RK, (6) DE, (7) C.
result = protpy.conjoint_triad(protein_seq)
# Shape: 1 x 343 (7 × 7 × 7 class combinations)
# 111 112 113 114 ...
# 7 17 11 3 ...
CTD Descriptors
CTD (Composition, Transition, Distribution) descriptors characterise the distribution of residues belonging to three physicochemical classes along the sequence. Seven physicochemical properties are supported.
Supported properties:
Property key |
Description |
Classes |
|---|---|---|
|
Hydrophobicity |
Polar / Neutral / Hydrophobic |
|
Normalised van der Waals volume |
0–2.78 / 2.95–4.0 / 4.03–8.08 |
|
Polarity |
4.9–6.2 / 8.0–9.2 / 10.4–13.0 |
|
Charge |
Positive / Neutral / Negative |
|
Secondary structure |
Helix / Strand / Coil |
|
Solvent accessibility |
Buried / Exposed / Intermediate |
|
Polarizability |
0–0.108 / 0.128–0.186 / 0.219–0.409 |
CTD Composition
Fraction of residues in each of the three physicochemical classes.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Physicochemical property to use |
result = protpy.ctd_composition(protein_seq)
result = protpy.ctd_composition(protein_seq, property="charge")
# hydrophobicity_CTD_C_01 hydrophobicity_CTD_C_02 hydrophobicity_CTD_C_03
# 0.279 0.386 0.335
CTD Transition
Fraction of transitions between each pair of the three physicochemical classes.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Physicochemical property to use |
result = protpy.ctd_transition(protein_seq)
result = protpy.ctd_transition(protein_seq, property="polarity")
# hydrophobicity_CTD_T_12 hydrophobicity_CTD_T_13 hydrophobicity_CTD_T_23
# 0.181 0.161 0.179
CTD Distribution
Position of the first, 25%, 50%, 75%, and last residue of each class within the sequence (as a percentage of total length).
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Physicochemical property to use |
result = protpy.ctd_distribution(protein_seq)
result = protpy.ctd_distribution(protein_seq, property="secondary_struct")
# hydrophobicity_CTD_D_01_01 hydrophobicity_CTD_D_02_01 ...
# 0.0796 0.557 ...
CTD Combined (ctd_)
Calculate Composition, Transition and Distribution for one or all supported properties in a single call.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str |
|
Property to use when |
|
bool |
|
If |
# All 7 properties (default)
result = protpy.ctd_(protein_seq)
# Single property
result = protpy.ctd_(protein_seq, property="charge", all_ctd=False)
# Shape: 1 x (3 + 3 + 15) per property → 1 x 147 for all 7 properties
# hydrophobicity_CTD_C_01 hydrophobicity_CTD_C_02 ...
# 0.279 0.386 ...
Sequence Order Descriptors
Sequence order descriptors capture the effect of residue spacing along the sequence using physicochemical distance matrices. Two distance matrices are supported:
Matrix |
Description |
|---|---|
|
Physicochemical distance based on Schneider-Wrede scale (default) |
|
Physicochemical distance based on Grantham’s amino acid difference formula |
Sequence Order Coupling Number — single (sequence_order_coupling_number_)
Computes the sum of squared physicochemical distances between all residue pairs
separated by a gap of d. Returns a single float.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Gap between residue pairs |
|
str |
|
Distance matrix to use |
result = protpy.sequence_order_coupling_number_(protein_seq)
result = protpy.sequence_order_coupling_number_(protein_seq, d=5, distance_matrix="grantham")
# Returns: 401.387 (float)
Sequence Order Coupling Number — series (sequence_order_coupling_number)
Calculates SOCN values across all gaps from 1 to lag.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum gap value |
|
str |
|
Distance matrix to use |
# Default parameters
result = protpy.sequence_order_coupling_number(protein_seq)
# Custom lag and matrix
result = protpy.sequence_order_coupling_number(protein_seq, lag=10, distance_matrix="grantham")
# Shape: 1 x lag → 1 x 30 with defaults
# SOCN_SW1 SOCN_SW2 SOCN_SW3 ...
# 401.387 409.243 376.946 ...
Sequence Order Coupling Number — all matrices (sequence_order_coupling_number_all)
Calculates SOCN across all lags using both the Schneider-Wrede and Grantham matrices and concatenates the results.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum gap value |
result = protpy.sequence_order_coupling_number_all(protein_seq)
result = protpy.sequence_order_coupling_number_all(protein_seq, lag=15)
# Shape: 1 x (2 × lag) → 1 x 60 with defaults
# SOCN_SW1 ... SOCN_Grant1 ...
Quasi Sequence Order (quasi_sequence_order)
Extends SOCN by combining standard amino acid composition with sequence-order
coupling numbers, weighted by a factor weight. Captures both residue type and
spatial distribution information.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum lag value |
|
float |
|
Weighting factor for coupling terms |
|
str |
|
Distance matrix to use |
# Default parameters
result = protpy.quasi_sequence_order(protein_seq)
# Custom parameters
result = protpy.quasi_sequence_order(protein_seq, lag=10, weight=0.2, distance_matrix="grantham")
# Shape: 1 x (20 + lag) → 1 x 50 with defaults
# QSO_SW1 QSO_SW2 QSO_SW3 ...
# 0.005692 0.002643 0.004947 ...
Quasi Sequence Order — all matrices (quasi_sequence_order_all)
Calculates Quasi Sequence Order using both distance matrices and concatenates the results.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
int |
|
Maximum lag value |
|
float |
|
Weighting factor for coupling terms |
result = protpy.quasi_sequence_order_all(protein_seq)
result = protpy.quasi_sequence_order_all(protein_seq, lag=15, weight=0.05)
# Shape: 1 x (2 × (20 + lag)) → 1 x 100 with defaults
# QSO_SW1 ... QSO_Grant1 ...
Descriptor Summary
Speed ratings reflect typical computation time for a single average-length protein (~500 residues) on a standard CPU:
Rating |
Meaning |
|---|---|
✅ Fast |
< 1 ms — simple residue counting or scalar formula |
🟡 Moderate |
1–50 ms — sliding window or O(n²) pass |
🔴 Slow |
> 50 ms — large feature space, iterative convergence, or many property lookups |
Note
Autocorrelation, PseAAC, and APAAComp scale with both sequence length and
lag — reduce lag or the number of properties to speed them up. Tripeptide
composition always produces 8000 columns regardless of sequence length.
Descriptor |
Function |
Output shape |
Category |
Speed |
Complexity |
|---|---|---|---|---|---|
Amino Acid Composition |
|
1 × 20 |
Composition |
✅ Fast |
O(n) |
Dipeptide Composition |
|
1 × 400 |
Composition |
✅ Fast |
O(n) |
Tripeptide Composition |
|
1 × 8000 |
Composition |
🟡 Moderate |
O(n) |
GRAVY |
|
1 × 1 |
Composition |
✅ Fast |
O(n) |
Aromaticity |
|
1 × 1 |
Composition |
✅ Fast |
O(n) |
Instability Index |
|
1 × 1 |
Composition |
✅ Fast |
O(n) |
Isoelectric Point |
|
1 × 1 |
Composition |
🟡 Moderate |
O(n · iter) |
Molecular Weight |
|
1 × 1 |
Composition |
✅ Fast |
O(n) |
Charge Distribution |
|
1 × 3 |
Composition |
✅ Fast |
O(n) |
Hydrophobic/Polar/Charged |
|
1 × 3 |
Composition |
✅ Fast |
O(n) |
Secondary Structure Propensity |
|
1 × 3 |
Composition |
✅ Fast |
O(n) |
k-mer Composition |
|
1 × 20^k |
Composition |
🟡 Moderate |
O(n · 20^k) |
Reduced Alphabet Composition |
|
1 × alphabet_size |
Composition |
✅ Fast |
O(n) |
Motif Composition |
|
1 × len(motifs) |
Composition |
🟡 Moderate |
O(n · m) |
Amino Acid Pair Composition |
|
1 × 400 |
Composition |
✅ Fast |
O(n) |
Aliphatic Index |
|
1 × 1 |
Composition |
✅ Fast |
O(n) |
Extinction Coefficient |
|
1 × 2 |
Composition |
✅ Fast |
O(n) |
Boman Index |
|
1 × 1 |
Composition |
✅ Fast |
O(n) |
Aggregation Propensity |
|
1 × 2 |
Composition |
🟡 Moderate |
O(n · win) |
Hydrophobic Moment |
|
1 × 2 |
Composition |
🟡 Moderate |
O(n · win) |
Shannon Entropy |
|
1 × 1 |
Composition |
✅ Fast |
O(n) |
Pseudo AAComp |
|
1 × (20 + lamda) |
Composition |
🔴 Slow |
O(n · lamda · props) |
Amphiphilic Pseudo AAComp |
|
1 × (20 + 2×lamda) |
Composition |
🔴 Slow |
O(n · lamda · props) |
Moreaubroto Autocorrelation |
|
1 × (lag × props) |
Autocorrelation |
🔴 Slow |
O(n · lag · props) |
Moran Autocorrelation |
|
1 × (lag × props) |
Autocorrelation |
🔴 Slow |
O(n · lag · props) |
Geary Autocorrelation |
|
1 × (lag × props) |
Autocorrelation |
🔴 Slow |
O(n · lag · props) |
Conjoint Triad |
|
1 × 343 |
Conjoint Triad |
✅ Fast |
O(n) |
CTD Composition |
|
1 × 3 |
CTD |
✅ Fast |
O(n) |
CTD Transition |
|
1 × 3 |
CTD |
✅ Fast |
O(n) |
CTD Distribution |
|
1 × 15 |
CTD |
✅ Fast |
O(n) |
CTD Combined |
|
1 × 147 |
CTD |
🟡 Moderate |
O(n · props) |
SOCN (single) |
|
float |
Sequence Order |
✅ Fast |
O(n) |
SOCN (series) |
|
1 × lag |
Sequence Order |
🟡 Moderate |
O(n · lag) |
SOCN (all matrices) |
|
1 × (2 × lag) |
Sequence Order |
🟡 Moderate |
O(n · lag) |
Quasi Sequence Order |
|
1 × (20 + lag) |
Sequence Order |
🟡 Moderate |
O(n · lag) |
Quasi Sequence Order (all) |
|
1 × (2 × (20 + lag)) |
Sequence Order |
🟡 Moderate |
O(n · lag) |
References
The descriptors implemented in protpy are based on the following published methods.
Composition
Amino acid, dipeptide, and tripeptide composition: Nakashima, H., Nishikawa, K., & Ooi, T. (1986). The folding type of a protein is relevant to the amino acid composition. Journal of Biochemistry, 99(1), 153–162.
GRAVY: Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157(1), 105–132.
Aromaticity: Lobry, J. R., & Gautier, C. (1994). Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Research, 22(15), 3174–3180.
Instability index: Guruprasad, K., Reddy, B. V. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition. Protein Engineering, 4(2), 155–161.
Isoelectric point: Bjellqvist, B., et al. (1994). The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis, 14(1), 1023–1031.
Molecular weight, isoelectric point, charge: Gasteiger, E., et al. (2005). Protein identification and analysis tools on the ExPASy server. In The Proteomics Protocols Handbook, Humana Press, 571–607.
Secondary structure propensity: Chou, P. Y., & Fasman, G. D. (1974). Prediction of protein conformation. Biochemistry, 13(2), 222–245.
Aliphatic index: Ikai, A. J. (1980). Thermostability and aliphatic index of globular proteins. Journal of Biochemistry, 88(6), 1895–1898.
Extinction coefficient: Pace, C. N., et al. (1995). How to measure and predict the molar absorption coefficient of a protein. Protein Science, 4(11), 2411–2423.
Boman index: Boman, H. G. (2003). Antibacterial peptides: basic facts and emerging concepts. Journal of Internal Medicine, 254(3), 197–215.
Hydrophobic moment: Eisenberg, D., Weiss, R. M., & Terwilliger, T. C. (1982). The helical hydrophobic moment: a measure of the amphiphilicity of a helix. Nature, 299, 371–374.
Shannon entropy: Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.
Pseudo amino acid composition (PseAAC): Chou, K.-C. (2001). Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics, 43(3), 246–255.
Amphiphilic PseAAC (APseAAC): Chou, K.-C. (2005). Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21(1), 10–19.
Autocorrelation
Moreau-Broto autocorrelation: Moreau, G., & Broto, P. (1980). The autocorrelation of a topological structure: A new molecular descriptor. Nouveau Journal de Chimie, 4, 359–360.
Moran autocorrelation: Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1–2), 17–23.
Geary autocorrelation: Geary, R. C. (1954). The contiguity ratio and statistical mapping. The Incorporated Statistician, 5(3), 115–145.
AAIndex properties: Kawashima, S., & Kanehisa, M. (2000). AAindex: amino acid index database. Nucleic Acids Research, 28(1), 374.
Conjoint Triad
Liu, B., et al. (2008). Prediction of protein-protein interactions based on the naive Bayes classifier with amino acid composition features. Biochemical and Biophysical Research Communications, 368(2), 462–468.
CTD
Dubchak, I., et al. (1995). Prediction of protein folding class using global description of amino acid sequence. PNAS, 92(19), 8700–8704.
Dubchak, I., et al. (1999). Recognition of a protein fold in the context of the SCOP classification. Proteins, 35(4), 401–407.
Sequence Order
Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science, 185(4154), 862–864.
Schneider, G., & Wrede, P. (1994). The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution. Biophysical Journal, 66(2), 335–344.
Chou, K.-C. (2000). Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochemical and Biophysical Research Communications, 278(2), 477–483.