Descriptor Reference

A complete reference for all protein descriptors available in the protpy package, grouped by category.


Setup

pip install protpy
import protpy

# Load a sequence from a FASTA file
from Bio import SeqIO
with open("protein.fasta") as f:
    protein_seq = str(next(SeqIO.parse(f, "fasta")).seq)

Composition Descriptors

Amino Acid Composition (AAComp)

Fraction of each of the 20 standard amino acids in the sequence.

result = protpy.amino_acid_composition(protein_seq)
# Shape: 1 x 20
# A      C      D      E      F  ...
# 6.693  3.108  5.817  3.347  6.614 ...

Dipeptide Composition (DPComp)

Frequency of all 400 possible dipeptide (two-residue) combinations.

result = protpy.dipeptide_composition(protein_seq)
# Shape: 1 x 400
# AA    AC    AD    AE    AF  ...
# 0.72  0.16  0.48  0.4   0.24 ...

Tripeptide Composition (TPComp)

Frequency of all 8000 possible tripeptide (three-residue) combinations.

result = protpy.tripeptide_composition(protein_seq)
# Shape: 1 x 8000
# AAA  AAC  AAD  AAE  AAF ...
# 1    0    0    2    0 ...

Grand Average of Hydropathy (GRAVY)

Mean of the Kyte-Doolittle hydropathy values across all residues. A positive value indicates overall hydrophobicity; negative indicates overall hydrophilicity.

result = protpy.gravy(protein_seq)
# Shape: 1 x 1
# GRAVY
# -0.045

Aromaticity

Fraction of aromatic residues (F, W, Y, H) in the sequence.

result = protpy.aromaticity(protein_seq)
# Shape: 1 x 1
# Aromaticity
# 0.118

Instability Index

Stability classifier based on dipeptide instability weight values (DIWV). Values below 40 indicate a stable protein; 40 or above indicates instability.

result = protpy.instability_index(protein_seq)
# Shape: 1 x 1
# InstabilityIndex
# 31.836

Isoelectric Point

Estimated pH at which the protein carries no net charge, calculated iteratively using standard pKa values for ionisable residues.

result = protpy.isoelectric_point(protein_seq)
# Shape: 1 x 1
# IsoelectricPoint
# 5.412

Molecular Weight

Average molecular weight of the protein calculated from residue masses, corrected for water lost at each peptide bond.

result = protpy.molecular_weight(protein_seq)
# Shape: 1 x 1
# MolecularWeight (Da)
# 139122.355

Charge Distribution

Positive, negative, and net charge contributions of ionisable residues at a given pH using the Henderson-Hasselbalch equation.

Parameter

Type

Default

Description

ph

float

7.4

pH at which to calculate charge

# Default pH 7.4
result = protpy.charge_distribution(protein_seq)

# Custom pH
result = protpy.charge_distribution(protein_seq, ph=6.0)

# Shape: 1 x 3
# PositiveCharge  NegativeCharge  NetCharge
# 99.526          114.956         -15.43

Hydrophobic/Polar/Charged Composition (HPC)

Percentage of residues belonging to each of three physicochemical groups: hydrophobic (A, C, F, I, L, M, V, W, Y), polar (G, N, Q, S, T), and charged (D, E, H, K, R).

result = protpy.hydrophobic_polar_charged_composition(protein_seq)
# Shape: 1 x 3
# Hydrophobic  Polar   Charged
# 44.542       32.669  18.247

Secondary Structure Propensity (SSP)

Average Chou-Fasman propensity values for alpha-helix, beta-sheet, and random coil conformations across all residues.

result = protpy.secondary_structure_propensity(protein_seq)
# Shape: 1 x 3
# Helix  Sheet  Coil
# 0.983  1.05   1.043

k-mer Composition

Frequency of all possible k-length residue subsequences, expressed as a percentage of total k-mers.

Parameter

Type

Default

Description

k

int

2

Length of each k-mer

# Default k=2 (dipeptide frequencies)
result = protpy.kmer_composition(protein_seq)

# Custom k
result = protpy.kmer_composition(protein_seq, k=3)

# Shape: 1 x 20^k  (e.g. 1 x 400 for k=2)
# AA     AC     AD  ...
# 0.797  0.159  ... ...

Reduced Alphabet Composition

Amino acid composition after mapping residues to a reduced alphabet of physicochemical groups. Supported alphabet sizes: 2, 3, 4, 6.

Parameter

Type

Default

Description

alphabet_size

int

6

Number of reduced groups

# Default alphabet_size=6
result = protpy.reduced_alphabet_composition(protein_seq)

# Custom size
result = protpy.reduced_alphabet_composition(protein_seq, alphabet_size=4)

# Shape: 1 x alphabet_size
# Group_1  Group_2  Group_3  Group_4  Group_5  Group_6
# 25.339   34.741   9.163    9.084    10.837   10.837

Motif Composition

Count of occurrences (including overlapping) of biological sequence motifs matched via regular expressions. Eight built-in motifs are used by default; a custom list can be supplied.

Default motifs:

Column

Pattern

Biological meaning

NxST_glycosylation

N[^P][ST]

N-linked glycosylation site

RGD_integrin

RGD

Integrin-binding RGD motif

KDEL_retention

KDEL

ER retention signal

CxxC_zinc_finger

C..C

Zinc-finger CxxC motif

CAAX_prenylation

C[A-Z]{2}[CSIM]$

CAAX prenylation box

cAMP_PKA

[RK]{2}.[ST]

cAMP/PKA phosphorylation site

dileucine_sorting

[DE]xxxL[LI]

Dileucine lysosomal sorting signal

PEST_region

P.{1,10}[ED]

PEST degradation signal

Parameter

Type

Default

Description

motifs

list or None

None

Custom regex patterns; uses built-in 8 if None

# Default built-in motifs
result = protpy.motif_composition(protein_seq)

# Custom motif list
result = protpy.motif_composition(protein_seq, motifs=[r'RGD', r'N[^P][ST]'])

# Shape: 1 x len(motifs)
# NxST_glycosylation  RGD_integrin  KDEL_retention  CxxC_zinc_finger  ...
# 23                  0             0               2                 ...

Amino Acid Pair Composition

Frequency of all 400 residue-pair combinations with column names annotated by the physicochemical class of each residue (Hydrophobic, Polar, Charged, or Other).

result = protpy.amino_acid_pair_composition(protein_seq)
# Shape: 1 x 400
# AA_Hydrophobic-Hydrophobic  AA_Hydrophobic-Polar  AA_Hydrophobic-Charged  ...
# 0.797                       0.159                 ...                     ...

Aliphatic Index

A measure of the relative volume occupied by aliphatic side chains (Ala, Val, Ile, Leu). Higher values indicate greater thermostability. Formula: AI = Ala% + 2.9×Val% + 3.9×(Ile%+Leu%).

result = protpy.aliphatic_index(protein_seq)
# Shape: 1 x 1
# AliphaticIndex
# 82.725

Extinction Coefficient

The molar extinction coefficient at 280 nm, calculated from the number of Trp (W), Tyr (Y), and Cys (C) residues. Reported for both reduced (no disulfide bonds) and oxidized (all Cys paired) states.

result = protpy.extinction_coefficient(protein_seq)
# Shape: 1 x 2
# ExtCoeff_Reduced  ExtCoeff_Oxidized
# 140960            143335

Boman Index

Sum of solubility values for amino acids divided by sequence length, predicting potential for protein–protein interactions. Positive values suggest membrane-binding or interaction potential.

result = protpy.boman_index(protein_seq)
# Shape: 1 x 1
# BomanIndex
# 0.119

Aggregation Propensity

Estimates aggregation-prone regions via a sliding-window approach combining Kyte–Doolittle hydrophobicity and charge neutrality. Returns the count of qualifying windows and the fraction of the sequence covered.

Parameter

Type

Default

Description

window

int

5

Sliding window size

hydrophobicity_threshold

float

2.0

Minimum mean hydrophobicity

charge_threshold

int

1

Maximum charged residues per window

result = protpy.aggregation_propensity(protein_seq)
# Shape: 1 x 2
# AggregProneRegions  AggregProneFraction
# 58                  11.793

Hydrophobic Moment

The mean and maximum hydrophobic moment across sliding windows, using the Eisenberg hydrophobicity scale and a helical-wheel projection. Captures amphipathicity of putative helix segments.

Parameter

Type

Default

Description

window

int

11

Sliding window size

angle

int

100

Residue rotation angle in degrees (100° for α-helix)

result = protpy.hydrophobic_moment(protein_seq)
# Shape: 1 x 2
# HydrophobicMoment_Mean  HydrophobicMoment_Max
# 0.272                   0.813

Shannon Entropy

Information-theoretic measure of amino acid diversity in a sequence. Computed as:

\[H = -\sum_i p_i \log_2 p_i\]

where \(p_i\) is the fractional frequency of each amino acid type present. A value of 0 indicates a completely repetitive (single amino acid) sequence; the theoretical maximum of \(\log_2(20) \approx 4.322\) bits corresponds to a perfectly uniform distribution across all 20 canonical amino acids. Widely used as a low-complexity filter and diversity measure in ML feature pipelines.

result = protpy.shannon_entropy(protein_seq)
# Shape: 1 x 1
# ShannonEntropy
# 4.163

Pseudo Amino Acid Composition (PAAComp)

Augmented amino acid composition that incorporates sequence-order effects via correlation factors derived from physicochemical properties. Reduces the dimensionality problem of pure sequence-order information while retaining more sequence information than simple AAComp.

Parameter

Type

Default

Description

lamda

int

30

Number of sequence-order correlation factors to include

weight

float

0.05

Weighting factor for correlation layers

properties

list

[]

AAIndex accession numbers to use (uses built-in set if empty)

# Default parameters
result = protpy.pseudo_amino_acid_composition(protein_seq)

# Custom parameters
result = protpy.pseudo_amino_acid_composition(protein_seq, lamda=10, weight=0.1)

# Shape: 1 x (20 + lamda)  →  1 x 50 with defaults
# PAAC_1  PAAC_2  PAAC_3  ...
# 0.127   0.059   0.111   ...

Amphiphilic Pseudo Amino Acid Composition (APAAComp)

Extension of PAAComp that uses both hydrophobicity and hydrophilicity properties to capture amphiphilic patterns (dual hydrophobic/hydrophilic character) along the sequence.

Parameter

Type

Default

Description

lamda

int

30

Number of sequence-order correlation factors

weight

float

0.5

Weighting factor for correlation layers

properties

list

[]

AAIndex accession numbers (defaults to hydrophobicity + hydrophilicity)

# Default parameters
result = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq)

# Custom parameters
result = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq, lamda=15, weight=0.3)

# Shape: 1 x (20 + 2*lamda)  →  1 x 80 with defaults
# APAAC_1  APAAC_2  APAAC_3  ...
# 6.624    3.076    5.757    ...

Autocorrelation Descriptors

Autocorrelation descriptors measure the correlation between physicochemical property values of residues separated by a lag distance along the sequence. By default, 8 AAIndex properties are used, generating lag × 8 = 240 features.

Default properties:

AAIndex ID

Property

CIDH920105

Normalised average hydrophobicity

BHAR880101

Average flexibility indices

CHAM820101

Polarizability parameter

CHAM820102

Free energy of solution in water

CHOC760101

Residue accessible surface area in tripeptide

BIGC670101

Residue volume

CHAM810101

Steric parameter

DAYM780201

Relative mutability


Moreaubroto Autocorrelation (MBAuto)

Uses raw property values as the basis for correlation measurement.

Parameter

Type

Default

Description

lag

int

30

Maximum lag distance

properties

list

(8 defaults above)

AAIndex accession numbers

normalize

bool

True

Normalise output values

# Default parameters
result = protpy.moreaubroto_autocorrelation(protein_seq)

# Custom parameters
result = protpy.moreaubroto_autocorrelation(protein_seq, lag=15, properties=["CIDH920105"])

# Shape: 1 x (lag × len(properties))  →  1 x 240 with defaults
# MBAuto_CIDH920105_1  MBAuto_CIDH920105_2  ...
# -0.052               -0.104               ...

Moran Autocorrelation (MAuto)

Uses the deviation from the mean property value, making it mean-centred and thereby less sensitive to the absolute property scale.

Parameter

Type

Default

Description

lag

int

30

Maximum lag distance

properties

list

(8 defaults above)

AAIndex accession numbers

normalize

bool

True

Normalise output values

# Default parameters
result = protpy.moran_autocorrelation(protein_seq)

# Custom parameters
result = protpy.moran_autocorrelation(protein_seq, lag=15)

# Shape: 1 x (lag × len(properties))  →  1 x 240 with defaults
# MAuto_CIDH920105_1  MAuto_CIDH920105_2  ...
# -0.07786            -0.07879            ...

Geary Autocorrelation (GAuto)

Uses squared differences between property values at each lag, making it sensitive to local dissimilarities rather than global correlation.

Parameter

Type

Default

Description

lag

int

30

Maximum lag distance

properties

list

(8 defaults above)

AAIndex accession numbers

normalize

bool

True

Normalise output values

# Default parameters
result = protpy.geary_autocorrelation(protein_seq)

# Custom parameters
result = protpy.geary_autocorrelation(protein_seq, lag=10, normalize=False)

# Shape: 1 x (lag × len(properties))  →  1 x 240 with defaults
# GAuto_CIDH920105_1  GAuto_CIDH920105_2  ...
# 1.057               1.077               ...

Conjoint Triad Descriptor

Conjoint Triad (CTriad)

Encodes the sequence using a 7-class reduced amino acid alphabet and computes the frequency of all consecutive three-residue combinations (triads). The 7 classes are: (1) AGV, (2) ILFP, (3) YMTS, (4) HNQW, (5) RK, (6) DE, (7) C.

result = protpy.conjoint_triad(protein_seq)
# Shape: 1 x 343  (7 × 7 × 7 class combinations)
# 111  112  113  114  ...
# 7    17   11   3    ...

CTD Descriptors

CTD (Composition, Transition, Distribution) descriptors characterise the distribution of residues belonging to three physicochemical classes along the sequence. Seven physicochemical properties are supported.

Supported properties:

Property key

Description

Classes

hydrophobicity

Hydrophobicity

Polar / Neutral / Hydrophobic

normalized_vdwv

Normalised van der Waals volume

0–2.78 / 2.95–4.0 / 4.03–8.08

polarity

Polarity

4.9–6.2 / 8.0–9.2 / 10.4–13.0

charge

Charge

Positive / Neutral / Negative

secondary_struct

Secondary structure

Helix / Strand / Coil

solvent_accessibility

Solvent accessibility

Buried / Exposed / Intermediate

polarizability

Polarizability

0–0.108 / 0.128–0.186 / 0.219–0.409


CTD Composition

Fraction of residues in each of the three physicochemical classes.

Parameter

Type

Default

Description

property

str

"hydrophobicity"

Physicochemical property to use

result = protpy.ctd_composition(protein_seq)
result = protpy.ctd_composition(protein_seq, property="charge")
# hydrophobicity_CTD_C_01  hydrophobicity_CTD_C_02  hydrophobicity_CTD_C_03
# 0.279                    0.386                    0.335

CTD Transition

Fraction of transitions between each pair of the three physicochemical classes.

Parameter

Type

Default

Description

property

str

"hydrophobicity"

Physicochemical property to use

result = protpy.ctd_transition(protein_seq)
result = protpy.ctd_transition(protein_seq, property="polarity")
# hydrophobicity_CTD_T_12  hydrophobicity_CTD_T_13  hydrophobicity_CTD_T_23
# 0.181                    0.161                    0.179

CTD Distribution

Position of the first, 25%, 50%, 75%, and last residue of each class within the sequence (as a percentage of total length).

Parameter

Type

Default

Description

property

str

"hydrophobicity"

Physicochemical property to use

result = protpy.ctd_distribution(protein_seq)
result = protpy.ctd_distribution(protein_seq, property="secondary_struct")
# hydrophobicity_CTD_D_01_01  hydrophobicity_CTD_D_02_01  ...
# 0.0796                      0.557                       ...

CTD Combined (ctd_)

Calculate Composition, Transition and Distribution for one or all supported properties in a single call.

Parameter

Type

Default

Description

property

str

"hydrophobicity"

Property to use when all_ctd=False

all_ctd

bool

True

If True, compute CTD for all 7 supported properties

# All 7 properties (default)
result = protpy.ctd_(protein_seq)

# Single property
result = protpy.ctd_(protein_seq, property="charge", all_ctd=False)

# Shape: 1 x (3 + 3 + 15) per property  →  1 x 147 for all 7 properties
# hydrophobicity_CTD_C_01  hydrophobicity_CTD_C_02  ...
# 0.279                    0.386                    ...

Sequence Order Descriptors

Sequence order descriptors capture the effect of residue spacing along the sequence using physicochemical distance matrices. Two distance matrices are supported:

Matrix

Description

schneider-wrede

Physicochemical distance based on Schneider-Wrede scale (default)

grantham

Physicochemical distance based on Grantham’s amino acid difference formula


Sequence Order Coupling Number — single (sequence_order_coupling_number_)

Computes the sum of squared physicochemical distances between all residue pairs separated by a gap of d. Returns a single float.

Parameter

Type

Default

Description

d

int

1

Gap between residue pairs

distance_matrix

str

"schneider-wrede"

Distance matrix to use

result = protpy.sequence_order_coupling_number_(protein_seq)
result = protpy.sequence_order_coupling_number_(protein_seq, d=5, distance_matrix="grantham")
# Returns: 401.387  (float)

Sequence Order Coupling Number — series (sequence_order_coupling_number)

Calculates SOCN values across all gaps from 1 to lag.

Parameter

Type

Default

Description

lag

int

30

Maximum gap value

distance_matrix

str

"schneider-wrede"

Distance matrix to use

# Default parameters
result = protpy.sequence_order_coupling_number(protein_seq)

# Custom lag and matrix
result = protpy.sequence_order_coupling_number(protein_seq, lag=10, distance_matrix="grantham")

# Shape: 1 x lag  →  1 x 30 with defaults
# SOCN_SW1   SOCN_SW2   SOCN_SW3  ...
# 401.387    409.243    376.946   ...

Sequence Order Coupling Number — all matrices (sequence_order_coupling_number_all)

Calculates SOCN across all lags using both the Schneider-Wrede and Grantham matrices and concatenates the results.

Parameter

Type

Default

Description

lag

int

30

Maximum gap value

result = protpy.sequence_order_coupling_number_all(protein_seq)
result = protpy.sequence_order_coupling_number_all(protein_seq, lag=15)
# Shape: 1 x (2 × lag)  →  1 x 60 with defaults
# SOCN_SW1  ...  SOCN_Grant1  ...

Quasi Sequence Order (quasi_sequence_order)

Extends SOCN by combining standard amino acid composition with sequence-order coupling numbers, weighted by a factor weight. Captures both residue type and spatial distribution information.

Parameter

Type

Default

Description

lag

int

30

Maximum lag value

weight

float

0.1

Weighting factor for coupling terms

distance_matrix

str

"schneider-wrede"

Distance matrix to use

# Default parameters
result = protpy.quasi_sequence_order(protein_seq)

# Custom parameters
result = protpy.quasi_sequence_order(protein_seq, lag=10, weight=0.2, distance_matrix="grantham")

# Shape: 1 x (20 + lag)  →  1 x 50 with defaults
# QSO_SW1    QSO_SW2    QSO_SW3   ...
# 0.005692   0.002643   0.004947  ...

Quasi Sequence Order — all matrices (quasi_sequence_order_all)

Calculates Quasi Sequence Order using both distance matrices and concatenates the results.

Parameter

Type

Default

Description

lag

int

30

Maximum lag value

weight

float

0.1

Weighting factor for coupling terms

result = protpy.quasi_sequence_order_all(protein_seq)
result = protpy.quasi_sequence_order_all(protein_seq, lag=15, weight=0.05)
# Shape: 1 x (2 × (20 + lag))  →  1 x 100 with defaults
# QSO_SW1  ...  QSO_Grant1  ...

Descriptor Summary

Speed ratings reflect typical computation time for a single average-length protein (~500 residues) on a standard CPU:

Rating

Meaning

✅ Fast

< 1 ms — simple residue counting or scalar formula

🟡 Moderate

1–50 ms — sliding window or O(n²) pass

🔴 Slow

> 50 ms — large feature space, iterative convergence, or many property lookups

Note

Autocorrelation, PseAAC, and APAAComp scale with both sequence length and lag — reduce lag or the number of properties to speed them up. Tripeptide composition always produces 8000 columns regardless of sequence length.

Descriptor

Function

Output shape

Category

Speed

Complexity

Amino Acid Composition

amino_acid_composition(seq)

1 × 20

Composition

✅ Fast

O(n)

Dipeptide Composition

dipeptide_composition(seq)

1 × 400

Composition

✅ Fast

O(n)

Tripeptide Composition

tripeptide_composition(seq)

1 × 8000

Composition

🟡 Moderate

O(n)

GRAVY

gravy(seq)

1 × 1

Composition

✅ Fast

O(n)

Aromaticity

aromaticity(seq)

1 × 1

Composition

✅ Fast

O(n)

Instability Index

instability_index(seq)

1 × 1

Composition

✅ Fast

O(n)

Isoelectric Point

isoelectric_point(seq)

1 × 1

Composition

🟡 Moderate

O(n · iter)

Molecular Weight

molecular_weight(seq)

1 × 1

Composition

✅ Fast

O(n)

Charge Distribution

charge_distribution(seq, ph=7.4)

1 × 3

Composition

✅ Fast

O(n)

Hydrophobic/Polar/Charged

hydrophobic_polar_charged_composition(seq)

1 × 3

Composition

✅ Fast

O(n)

Secondary Structure Propensity

secondary_structure_propensity(seq)

1 × 3

Composition

✅ Fast

O(n)

k-mer Composition

kmer_composition(seq, k=2)

1 × 20^k

Composition

🟡 Moderate

O(n · 20^k)

Reduced Alphabet Composition

reduced_alphabet_composition(seq, alphabet_size=6)

1 × alphabet_size

Composition

✅ Fast

O(n)

Motif Composition

motif_composition(seq, motifs=None)

1 × len(motifs)

Composition

🟡 Moderate

O(n · m)

Amino Acid Pair Composition

amino_acid_pair_composition(seq)

1 × 400

Composition

✅ Fast

O(n)

Aliphatic Index

aliphatic_index(seq)

1 × 1

Composition

✅ Fast

O(n)

Extinction Coefficient

extinction_coefficient(seq)

1 × 2

Composition

✅ Fast

O(n)

Boman Index

boman_index(seq)

1 × 1

Composition

✅ Fast

O(n)

Aggregation Propensity

aggregation_propensity(seq, window=5)

1 × 2

Composition

🟡 Moderate

O(n · win)

Hydrophobic Moment

hydrophobic_moment(seq, window=11, angle=100)

1 × 2

Composition

🟡 Moderate

O(n · win)

Shannon Entropy

shannon_entropy(seq)

1 × 1

Composition

✅ Fast

O(n)

Pseudo AAComp

pseudo_amino_acid_composition(seq, lamda=30, weight=0.05)

1 × (20 + lamda)

Composition

🔴 Slow

O(n · lamda · props)

Amphiphilic Pseudo AAComp

amphiphilic_pseudo_amino_acid_composition(seq, lamda=30, weight=0.5)

1 × (20 + 2×lamda)

Composition

🔴 Slow

O(n · lamda · props)

Moreaubroto Autocorrelation

moreaubroto_autocorrelation(seq, lag=30)

1 × (lag × props)

Autocorrelation

🔴 Slow

O(n · lag · props)

Moran Autocorrelation

moran_autocorrelation(seq, lag=30)

1 × (lag × props)

Autocorrelation

🔴 Slow

O(n · lag · props)

Geary Autocorrelation

geary_autocorrelation(seq, lag=30)

1 × (lag × props)

Autocorrelation

🔴 Slow

O(n · lag · props)

Conjoint Triad

conjoint_triad(seq)

1 × 343

Conjoint Triad

✅ Fast

O(n)

CTD Composition

ctd_composition(seq, property="hydrophobicity")

1 × 3

CTD

✅ Fast

O(n)

CTD Transition

ctd_transition(seq, property="hydrophobicity")

1 × 3

CTD

✅ Fast

O(n)

CTD Distribution

ctd_distribution(seq, property="hydrophobicity")

1 × 15

CTD

✅ Fast

O(n)

CTD Combined

ctd_(seq, property="hydrophobicity", all_ctd=True)

1 × 147

CTD

🟡 Moderate

O(n · props)

SOCN (single)

sequence_order_coupling_number_(seq, d=1)

float

Sequence Order

✅ Fast

O(n)

SOCN (series)

sequence_order_coupling_number(seq, lag=30)

1 × lag

Sequence Order

🟡 Moderate

O(n · lag)

SOCN (all matrices)

sequence_order_coupling_number_all(seq, lag=30)

1 × (2 × lag)

Sequence Order

🟡 Moderate

O(n · lag)

Quasi Sequence Order

quasi_sequence_order(seq, lag=30, weight=0.1)

1 × (20 + lag)

Sequence Order

🟡 Moderate

O(n · lag)

Quasi Sequence Order (all)

quasi_sequence_order_all(seq, lag=30, weight=0.1)

1 × (2 × (20 + lag))

Sequence Order

🟡 Moderate

O(n · lag)


References

The descriptors implemented in protpy are based on the following published methods.

Composition

  • Amino acid, dipeptide, and tripeptide composition: Nakashima, H., Nishikawa, K., & Ooi, T. (1986). The folding type of a protein is relevant to the amino acid composition. Journal of Biochemistry, 99(1), 153–162.

  • GRAVY: Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of Molecular Biology, 157(1), 105–132.

  • Aromaticity: Lobry, J. R., & Gautier, C. (1994). Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Research, 22(15), 3174–3180.

  • Instability index: Guruprasad, K., Reddy, B. V. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition. Protein Engineering, 4(2), 155–161.

  • Isoelectric point: Bjellqvist, B., et al. (1994). The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis, 14(1), 1023–1031.

  • Molecular weight, isoelectric point, charge: Gasteiger, E., et al. (2005). Protein identification and analysis tools on the ExPASy server. In The Proteomics Protocols Handbook, Humana Press, 571–607.

  • Secondary structure propensity: Chou, P. Y., & Fasman, G. D. (1974). Prediction of protein conformation. Biochemistry, 13(2), 222–245.

  • Aliphatic index: Ikai, A. J. (1980). Thermostability and aliphatic index of globular proteins. Journal of Biochemistry, 88(6), 1895–1898.

  • Extinction coefficient: Pace, C. N., et al. (1995). How to measure and predict the molar absorption coefficient of a protein. Protein Science, 4(11), 2411–2423.

  • Boman index: Boman, H. G. (2003). Antibacterial peptides: basic facts and emerging concepts. Journal of Internal Medicine, 254(3), 197–215.

  • Hydrophobic moment: Eisenberg, D., Weiss, R. M., & Terwilliger, T. C. (1982). The helical hydrophobic moment: a measure of the amphiphilicity of a helix. Nature, 299, 371–374.

  • Shannon entropy: Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423.

  • Pseudo amino acid composition (PseAAC): Chou, K.-C. (2001). Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics, 43(3), 246–255.

  • Amphiphilic PseAAC (APseAAC): Chou, K.-C. (2005). Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21(1), 10–19.

Autocorrelation

  • Moreau-Broto autocorrelation: Moreau, G., & Broto, P. (1980). The autocorrelation of a topological structure: A new molecular descriptor. Nouveau Journal de Chimie, 4, 359–360.

  • Moran autocorrelation: Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1–2), 17–23.

  • Geary autocorrelation: Geary, R. C. (1954). The contiguity ratio and statistical mapping. The Incorporated Statistician, 5(3), 115–145.

  • AAIndex properties: Kawashima, S., & Kanehisa, M. (2000). AAindex: amino acid index database. Nucleic Acids Research, 28(1), 374.

Conjoint Triad

  • Liu, B., et al. (2008). Prediction of protein-protein interactions based on the naive Bayes classifier with amino acid composition features. Biochemical and Biophysical Research Communications, 368(2), 462–468.

CTD

  • Dubchak, I., et al. (1995). Prediction of protein folding class using global description of amino acid sequence. PNAS, 92(19), 8700–8704.

  • Dubchak, I., et al. (1999). Recognition of a protein fold in the context of the SCOP classification. Proteins, 35(4), 401–407.

Sequence Order

  • Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science, 185(4154), 862–864.

  • Schneider, G., & Wrede, P. (1994). The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution. Biophysical Journal, 66(2), 335–344.

  • Chou, K.-C. (2000). Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochemical and Biophysical Research Communications, 278(2), 477–483.