Descriptor Reference ==================== A complete reference for all protein descriptors available in the **protpy** package, grouped by category. .. contents:: Contents :local: :depth: 2 ---- Setup ----- .. code-block:: console pip install protpy .. code-block:: python import protpy # Load a sequence from a FASTA file from Bio import SeqIO with open("protein.fasta") as f: protein_seq = str(next(SeqIO.parse(f, "fasta")).seq) ---- Composition Descriptors ----------------------- Amino Acid Composition (AAComp) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fraction of each of the 20 standard amino acids in the sequence. .. code-block:: python result = protpy.amino_acid_composition(protein_seq) # Shape: 1 x 20 # A C D E F ... # 6.693 3.108 5.817 3.347 6.614 ... ---- Dipeptide Composition (DPComp) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Frequency of all 400 possible dipeptide (two-residue) combinations. .. code-block:: python result = protpy.dipeptide_composition(protein_seq) # Shape: 1 x 400 # AA AC AD AE AF ... # 0.72 0.16 0.48 0.4 0.24 ... ---- Tripeptide Composition (TPComp) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Frequency of all 8000 possible tripeptide (three-residue) combinations. .. code-block:: python result = protpy.tripeptide_composition(protein_seq) # Shape: 1 x 8000 # AAA AAC AAD AAE AAF ... # 1 0 0 2 0 ... ---- Grand Average of Hydropathy (GRAVY) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Mean of the Kyte-Doolittle hydropathy values across all residues. A positive value indicates overall hydrophobicity; negative indicates overall hydrophilicity. .. code-block:: python result = protpy.gravy(protein_seq) # Shape: 1 x 1 # GRAVY # -0.045 ---- Aromaticity ~~~~~~~~~~~ Fraction of aromatic residues (F, W, Y, H) in the sequence. .. code-block:: python result = protpy.aromaticity(protein_seq) # Shape: 1 x 1 # Aromaticity # 0.118 ---- Instability Index ~~~~~~~~~~~~~~~~~ Stability classifier based on dipeptide instability weight values (DIWV). Values below 40 indicate a stable protein; 40 or above indicates instability. .. code-block:: python result = protpy.instability_index(protein_seq) # Shape: 1 x 1 # InstabilityIndex # 31.836 ---- Isoelectric Point ~~~~~~~~~~~~~~~~~ Estimated pH at which the protein carries no net charge, calculated iteratively using standard pKa values for ionisable residues. .. code-block:: python result = protpy.isoelectric_point(protein_seq) # Shape: 1 x 1 # IsoelectricPoint # 5.412 ---- Molecular Weight ~~~~~~~~~~~~~~~~ Average molecular weight of the protein calculated from residue masses, corrected for water lost at each peptide bond. .. code-block:: python result = protpy.molecular_weight(protein_seq) # Shape: 1 x 1 # MolecularWeight (Da) # 139122.355 ---- Charge Distribution ~~~~~~~~~~~~~~~~~~~~ Positive, negative, and net charge contributions of ionisable residues at a given pH using the Henderson-Hasselbalch equation. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``ph`` - float - ``7.4`` - pH at which to calculate charge .. code-block:: python # Default pH 7.4 result = protpy.charge_distribution(protein_seq) # Custom pH result = protpy.charge_distribution(protein_seq, ph=6.0) # Shape: 1 x 3 # PositiveCharge NegativeCharge NetCharge # 99.526 114.956 -15.43 ---- Hydrophobic/Polar/Charged Composition (HPC) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Percentage of residues belonging to each of three physicochemical groups: hydrophobic (A, C, F, I, L, M, V, W, Y), polar (G, N, Q, S, T), and charged (D, E, H, K, R). .. code-block:: python result = protpy.hydrophobic_polar_charged_composition(protein_seq) # Shape: 1 x 3 # Hydrophobic Polar Charged # 44.542 32.669 18.247 ---- Secondary Structure Propensity (SSP) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Average Chou-Fasman propensity values for alpha-helix, beta-sheet, and random coil conformations across all residues. .. code-block:: python result = protpy.secondary_structure_propensity(protein_seq) # Shape: 1 x 3 # Helix Sheet Coil # 0.983 1.05 1.043 ---- k-mer Composition ~~~~~~~~~~~~~~~~~ Frequency of all possible k-length residue subsequences, expressed as a percentage of total k-mers. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``k`` - int - ``2`` - Length of each k-mer .. code-block:: python # Default k=2 (dipeptide frequencies) result = protpy.kmer_composition(protein_seq) # Custom k result = protpy.kmer_composition(protein_seq, k=3) # Shape: 1 x 20^k (e.g. 1 x 400 for k=2) # AA AC AD ... # 0.797 0.159 ... ... ---- Reduced Alphabet Composition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Amino acid composition after mapping residues to a reduced alphabet of physicochemical groups. Supported alphabet sizes: ``2``, ``3``, ``4``, ``6``. .. list-table:: :header-rows: 1 :widths: 20 10 15 55 * - Parameter - Type - Default - Description * - ``alphabet_size`` - int - ``6`` - Number of reduced groups .. code-block:: python # Default alphabet_size=6 result = protpy.reduced_alphabet_composition(protein_seq) # Custom size result = protpy.reduced_alphabet_composition(protein_seq, alphabet_size=4) # Shape: 1 x alphabet_size # Group_1 Group_2 Group_3 Group_4 Group_5 Group_6 # 25.339 34.741 9.163 9.084 10.837 10.837 ---- Motif Composition ~~~~~~~~~~~~~~~~~ Count of occurrences (including overlapping) of biological sequence motifs matched via regular expressions. Eight built-in motifs are used by default; a custom list can be supplied. **Default motifs:** .. list-table:: :header-rows: 1 :widths: 25 20 55 * - Column - Pattern - Biological meaning * - ``NxST_glycosylation`` - ``N[^P][ST]`` - N-linked glycosylation site * - ``RGD_integrin`` - ``RGD`` - Integrin-binding RGD motif * - ``KDEL_retention`` - ``KDEL`` - ER retention signal * - ``CxxC_zinc_finger`` - ``C..C`` - Zinc-finger CxxC motif * - ``CAAX_prenylation`` - ``C[A-Z]{2}[CSIM]$`` - CAAX prenylation box * - ``cAMP_PKA`` - ``[RK]{2}.[ST]`` - cAMP/PKA phosphorylation site * - ``dileucine_sorting`` - ``[DE]xxxL[LI]`` - Dileucine lysosomal sorting signal * - ``PEST_region`` - ``P.{1,10}[ED]`` - PEST degradation signal .. list-table:: :header-rows: 1 :widths: 20 10 15 55 * - Parameter - Type - Default - Description * - ``motifs`` - list or None - ``None`` - Custom regex patterns; uses built-in 8 if ``None`` .. code-block:: python # Default built-in motifs result = protpy.motif_composition(protein_seq) # Custom motif list result = protpy.motif_composition(protein_seq, motifs=[r'RGD', r'N[^P][ST]']) # Shape: 1 x len(motifs) # NxST_glycosylation RGD_integrin KDEL_retention CxxC_zinc_finger ... # 23 0 0 2 ... ---- Amino Acid Pair Composition ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Frequency of all 400 residue-pair combinations with column names annotated by the physicochemical class of each residue (Hydrophobic, Polar, Charged, or Other). .. code-block:: python result = protpy.amino_acid_pair_composition(protein_seq) # Shape: 1 x 400 # AA_Hydrophobic-Hydrophobic AA_Hydrophobic-Polar AA_Hydrophobic-Charged ... # 0.797 0.159 ... ... ---- Aliphatic Index ~~~~~~~~~~~~~~~ A measure of the relative volume occupied by aliphatic side chains (Ala, Val, Ile, Leu). Higher values indicate greater thermostability. Formula: AI = Ala% + 2.9×Val% + 3.9×(Ile%+Leu%). .. code-block:: python result = protpy.aliphatic_index(protein_seq) # Shape: 1 x 1 # AliphaticIndex # 82.725 ---- Extinction Coefficient ~~~~~~~~~~~~~~~~~~~~~~ The molar extinction coefficient at 280 nm, calculated from the number of Trp (W), Tyr (Y), and Cys (C) residues. Reported for both reduced (no disulfide bonds) and oxidized (all Cys paired) states. .. code-block:: python result = protpy.extinction_coefficient(protein_seq) # Shape: 1 x 2 # ExtCoeff_Reduced ExtCoeff_Oxidized # 140960 143335 ---- Boman Index ~~~~~~~~~~~ Sum of solubility values for amino acids divided by sequence length, predicting potential for protein–protein interactions. Positive values suggest membrane-binding or interaction potential. .. code-block:: python result = protpy.boman_index(protein_seq) # Shape: 1 x 1 # BomanIndex # 0.119 ---- Aggregation Propensity ~~~~~~~~~~~~~~~~~~~~~~ Estimates aggregation-prone regions via a sliding-window approach combining Kyte–Doolittle hydrophobicity and charge neutrality. Returns the count of qualifying windows and the fraction of the sequence covered. .. list-table:: :header-rows: 1 :widths: 25 10 15 50 * - Parameter - Type - Default - Description * - ``window`` - int - ``5`` - Sliding window size * - ``hydrophobicity_threshold`` - float - ``2.0`` - Minimum mean hydrophobicity * - ``charge_threshold`` - int - ``1`` - Maximum charged residues per window .. code-block:: python result = protpy.aggregation_propensity(protein_seq) # Shape: 1 x 2 # AggregProneRegions AggregProneFraction # 58 11.793 ---- Hydrophobic Moment ~~~~~~~~~~~~~~~~~~ The mean and maximum hydrophobic moment across sliding windows, using the Eisenberg hydrophobicity scale and a helical-wheel projection. Captures amphipathicity of putative helix segments. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``window`` - int - ``11`` - Sliding window size * - ``angle`` - int - ``100`` - Residue rotation angle in degrees (100° for α-helix) .. code-block:: python result = protpy.hydrophobic_moment(protein_seq) # Shape: 1 x 2 # HydrophobicMoment_Mean HydrophobicMoment_Max # 0.272 0.813 ---- Shannon Entropy ~~~~~~~~~~~~~~~ Information-theoretic measure of amino acid diversity in a sequence. Computed as: .. math:: H = -\sum_i p_i \log_2 p_i where :math:`p_i` is the fractional frequency of each amino acid type present. A value of 0 indicates a completely repetitive (single amino acid) sequence; the theoretical maximum of :math:`\log_2(20) \approx 4.322` bits corresponds to a perfectly uniform distribution across all 20 canonical amino acids. Widely used as a low-complexity filter and diversity measure in ML feature pipelines. .. code-block:: python result = protpy.shannon_entropy(protein_seq) # Shape: 1 x 1 # ShannonEntropy # 4.163 ---- Pseudo Amino Acid Composition (PAAComp) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Augmented amino acid composition that incorporates sequence-order effects via correlation factors derived from physicochemical properties. Reduces the dimensionality problem of pure sequence-order information while retaining more sequence information than simple AAComp. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``lamda`` - int - ``30`` - Number of sequence-order correlation factors to include * - ``weight`` - float - ``0.05`` - Weighting factor for correlation layers * - ``properties`` - list - ``[]`` - AAIndex accession numbers to use (uses built-in set if empty) .. code-block:: python # Default parameters result = protpy.pseudo_amino_acid_composition(protein_seq) # Custom parameters result = protpy.pseudo_amino_acid_composition(protein_seq, lamda=10, weight=0.1) # Shape: 1 x (20 + lamda) → 1 x 50 with defaults # PAAC_1 PAAC_2 PAAC_3 ... # 0.127 0.059 0.111 ... ---- Amphiphilic Pseudo Amino Acid Composition (APAAComp) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Extension of PAAComp that uses both hydrophobicity and hydrophilicity properties to capture amphiphilic patterns (dual hydrophobic/hydrophilic character) along the sequence. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``lamda`` - int - ``30`` - Number of sequence-order correlation factors * - ``weight`` - float - ``0.5`` - Weighting factor for correlation layers * - ``properties`` - list - ``[]`` - AAIndex accession numbers (defaults to hydrophobicity + hydrophilicity) .. code-block:: python # Default parameters result = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq) # Custom parameters result = protpy.amphiphilic_pseudo_amino_acid_composition(protein_seq, lamda=15, weight=0.3) # Shape: 1 x (20 + 2*lamda) → 1 x 80 with defaults # APAAC_1 APAAC_2 APAAC_3 ... # 6.624 3.076 5.757 ... ---- Autocorrelation Descriptors ---------------------------- Autocorrelation descriptors measure the correlation between physicochemical property values of residues separated by a lag distance along the sequence. By default, 8 AAIndex properties are used, generating ``lag × 8 = 240`` features. **Default properties:** .. list-table:: :header-rows: 1 :widths: 25 75 * - AAIndex ID - Property * - ``CIDH920105`` - Normalised average hydrophobicity * - ``BHAR880101`` - Average flexibility indices * - ``CHAM820101`` - Polarizability parameter * - ``CHAM820102`` - Free energy of solution in water * - ``CHOC760101`` - Residue accessible surface area in tripeptide * - ``BIGC670101`` - Residue volume * - ``CHAM810101`` - Steric parameter * - ``DAYM780201`` - Relative mutability ---- Moreaubroto Autocorrelation (MBAuto) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Uses raw property values as the basis for correlation measurement. .. list-table:: :header-rows: 1 :widths: 15 10 25 50 * - Parameter - Type - Default - Description * - ``lag`` - int - ``30`` - Maximum lag distance * - ``properties`` - list - (8 defaults above) - AAIndex accession numbers * - ``normalize`` - bool - ``True`` - Normalise output values .. code-block:: python # Default parameters result = protpy.moreaubroto_autocorrelation(protein_seq) # Custom parameters result = protpy.moreaubroto_autocorrelation(protein_seq, lag=15, properties=["CIDH920105"]) # Shape: 1 x (lag × len(properties)) → 1 x 240 with defaults # MBAuto_CIDH920105_1 MBAuto_CIDH920105_2 ... # -0.052 -0.104 ... ---- Moran Autocorrelation (MAuto) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Uses the deviation from the mean property value, making it mean-centred and thereby less sensitive to the absolute property scale. .. list-table:: :header-rows: 1 :widths: 15 10 25 50 * - Parameter - Type - Default - Description * - ``lag`` - int - ``30`` - Maximum lag distance * - ``properties`` - list - (8 defaults above) - AAIndex accession numbers * - ``normalize`` - bool - ``True`` - Normalise output values .. code-block:: python # Default parameters result = protpy.moran_autocorrelation(protein_seq) # Custom parameters result = protpy.moran_autocorrelation(protein_seq, lag=15) # Shape: 1 x (lag × len(properties)) → 1 x 240 with defaults # MAuto_CIDH920105_1 MAuto_CIDH920105_2 ... # -0.07786 -0.07879 ... ---- Geary Autocorrelation (GAuto) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Uses squared differences between property values at each lag, making it sensitive to local dissimilarities rather than global correlation. .. list-table:: :header-rows: 1 :widths: 15 10 25 50 * - Parameter - Type - Default - Description * - ``lag`` - int - ``30`` - Maximum lag distance * - ``properties`` - list - (8 defaults above) - AAIndex accession numbers * - ``normalize`` - bool - ``True`` - Normalise output values .. code-block:: python # Default parameters result = protpy.geary_autocorrelation(protein_seq) # Custom parameters result = protpy.geary_autocorrelation(protein_seq, lag=10, normalize=False) # Shape: 1 x (lag × len(properties)) → 1 x 240 with defaults # GAuto_CIDH920105_1 GAuto_CIDH920105_2 ... # 1.057 1.077 ... ---- Conjoint Triad Descriptor -------------------------- Conjoint Triad (CTriad) ~~~~~~~~~~~~~~~~~~~~~~~~ Encodes the sequence using a 7-class reduced amino acid alphabet and computes the frequency of all consecutive three-residue combinations (triads). The 7 classes are: (1) AGV, (2) ILFP, (3) YMTS, (4) HNQW, (5) RK, (6) DE, (7) C. .. code-block:: python result = protpy.conjoint_triad(protein_seq) # Shape: 1 x 343 (7 × 7 × 7 class combinations) # 111 112 113 114 ... # 7 17 11 3 ... ---- CTD Descriptors ---------------- CTD (Composition, Transition, Distribution) descriptors characterise the distribution of residues belonging to three physicochemical classes along the sequence. Seven physicochemical properties are supported. **Supported properties:** .. list-table:: :header-rows: 1 :widths: 25 30 45 * - Property key - Description - Classes * - ``hydrophobicity`` - Hydrophobicity - Polar / Neutral / Hydrophobic * - ``normalized_vdwv`` - Normalised van der Waals volume - 0–2.78 / 2.95–4.0 / 4.03–8.08 * - ``polarity`` - Polarity - 4.9–6.2 / 8.0–9.2 / 10.4–13.0 * - ``charge`` - Charge - Positive / Neutral / Negative * - ``secondary_struct`` - Secondary structure - Helix / Strand / Coil * - ``solvent_accessibility`` - Solvent accessibility - Buried / Exposed / Intermediate * - ``polarizability`` - Polarizability - 0–0.108 / 0.128–0.186 / 0.219–0.409 ---- CTD Composition ~~~~~~~~~~~~~~~ Fraction of residues in each of the three physicochemical classes. .. list-table:: :header-rows: 1 :widths: 15 10 25 50 * - Parameter - Type - Default - Description * - ``property`` - str - ``"hydrophobicity"`` - Physicochemical property to use .. code-block:: python result = protpy.ctd_composition(protein_seq) result = protpy.ctd_composition(protein_seq, property="charge") # hydrophobicity_CTD_C_01 hydrophobicity_CTD_C_02 hydrophobicity_CTD_C_03 # 0.279 0.386 0.335 ---- CTD Transition ~~~~~~~~~~~~~~ Fraction of transitions between each pair of the three physicochemical classes. .. list-table:: :header-rows: 1 :widths: 15 10 25 50 * - Parameter - Type - Default - Description * - ``property`` - str - ``"hydrophobicity"`` - Physicochemical property to use .. code-block:: python result = protpy.ctd_transition(protein_seq) result = protpy.ctd_transition(protein_seq, property="polarity") # hydrophobicity_CTD_T_12 hydrophobicity_CTD_T_13 hydrophobicity_CTD_T_23 # 0.181 0.161 0.179 ---- CTD Distribution ~~~~~~~~~~~~~~~~ Position of the first, 25%, 50%, 75%, and last residue of each class within the sequence (as a percentage of total length). .. list-table:: :header-rows: 1 :widths: 15 10 25 50 * - Parameter - Type - Default - Description * - ``property`` - str - ``"hydrophobicity"`` - Physicochemical property to use .. code-block:: python result = protpy.ctd_distribution(protein_seq) result = protpy.ctd_distribution(protein_seq, property="secondary_struct") # hydrophobicity_CTD_D_01_01 hydrophobicity_CTD_D_02_01 ... # 0.0796 0.557 ... ---- CTD Combined (``ctd_``) ~~~~~~~~~~~~~~~~~~~~~~~ Calculate Composition, Transition **and** Distribution for one or all supported properties in a single call. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``property`` - str - ``"hydrophobicity"`` - Property to use when ``all_ctd=False`` * - ``all_ctd`` - bool - ``True`` - If ``True``, compute CTD for all 7 supported properties .. code-block:: python # All 7 properties (default) result = protpy.ctd_(protein_seq) # Single property result = protpy.ctd_(protein_seq, property="charge", all_ctd=False) # Shape: 1 x (3 + 3 + 15) per property → 1 x 147 for all 7 properties # hydrophobicity_CTD_C_01 hydrophobicity_CTD_C_02 ... # 0.279 0.386 ... ---- Sequence Order Descriptors --------------------------- Sequence order descriptors capture the effect of residue spacing along the sequence using physicochemical distance matrices. Two distance matrices are supported: .. list-table:: :header-rows: 1 :widths: 25 75 * - Matrix - Description * - ``schneider-wrede`` - Physicochemical distance based on Schneider-Wrede scale (default) * - ``grantham`` - Physicochemical distance based on Grantham's amino acid difference formula ---- Sequence Order Coupling Number — single (``sequence_order_coupling_number_``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Computes the sum of squared physicochemical distances between all residue pairs separated by a gap of ``d``. Returns a single float. .. list-table:: :header-rows: 1 :widths: 20 10 25 45 * - Parameter - Type - Default - Description * - ``d`` - int - ``1`` - Gap between residue pairs * - ``distance_matrix`` - str - ``"schneider-wrede"`` - Distance matrix to use .. code-block:: python result = protpy.sequence_order_coupling_number_(protein_seq) result = protpy.sequence_order_coupling_number_(protein_seq, d=5, distance_matrix="grantham") # Returns: 401.387 (float) ---- Sequence Order Coupling Number — series (``sequence_order_coupling_number``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Calculates SOCN values across all gaps from 1 to ``lag``. .. list-table:: :header-rows: 1 :widths: 20 10 25 45 * - Parameter - Type - Default - Description * - ``lag`` - int - ``30`` - Maximum gap value * - ``distance_matrix`` - str - ``"schneider-wrede"`` - Distance matrix to use .. code-block:: python # Default parameters result = protpy.sequence_order_coupling_number(protein_seq) # Custom lag and matrix result = protpy.sequence_order_coupling_number(protein_seq, lag=10, distance_matrix="grantham") # Shape: 1 x lag → 1 x 30 with defaults # SOCN_SW1 SOCN_SW2 SOCN_SW3 ... # 401.387 409.243 376.946 ... ---- Sequence Order Coupling Number — all matrices (``sequence_order_coupling_number_all``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Calculates SOCN across all lags using **both** the Schneider-Wrede and Grantham matrices and concatenates the results. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``lag`` - int - ``30`` - Maximum gap value .. code-block:: python result = protpy.sequence_order_coupling_number_all(protein_seq) result = protpy.sequence_order_coupling_number_all(protein_seq, lag=15) # Shape: 1 x (2 × lag) → 1 x 60 with defaults # SOCN_SW1 ... SOCN_Grant1 ... ---- Quasi Sequence Order (``quasi_sequence_order``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Extends SOCN by combining standard amino acid composition with sequence-order coupling numbers, weighted by a factor ``weight``. Captures both residue type and spatial distribution information. .. list-table:: :header-rows: 1 :widths: 20 10 25 45 * - Parameter - Type - Default - Description * - ``lag`` - int - ``30`` - Maximum lag value * - ``weight`` - float - ``0.1`` - Weighting factor for coupling terms * - ``distance_matrix`` - str - ``"schneider-wrede"`` - Distance matrix to use .. code-block:: python # Default parameters result = protpy.quasi_sequence_order(protein_seq) # Custom parameters result = protpy.quasi_sequence_order(protein_seq, lag=10, weight=0.2, distance_matrix="grantham") # Shape: 1 x (20 + lag) → 1 x 50 with defaults # QSO_SW1 QSO_SW2 QSO_SW3 ... # 0.005692 0.002643 0.004947 ... ---- Quasi Sequence Order — all matrices (``quasi_sequence_order_all``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Calculates Quasi Sequence Order using **both** distance matrices and concatenates the results. .. list-table:: :header-rows: 1 :widths: 15 10 15 60 * - Parameter - Type - Default - Description * - ``lag`` - int - ``30`` - Maximum lag value * - ``weight`` - float - ``0.1`` - Weighting factor for coupling terms .. code-block:: python result = protpy.quasi_sequence_order_all(protein_seq) result = protpy.quasi_sequence_order_all(protein_seq, lag=15, weight=0.05) # Shape: 1 x (2 × (20 + lag)) → 1 x 100 with defaults # QSO_SW1 ... QSO_Grant1 ... ---- Descriptor Summary ------------------ Speed ratings reflect typical computation time for a single average-length protein (~500 residues) on a standard CPU: .. list-table:: :header-rows: 1 :widths: 10 90 * - Rating - Meaning * - ✅ Fast - < 1 ms — simple residue counting or scalar formula * - 🟡 Moderate - 1–50 ms — sliding window or O(n²) pass * - 🔴 Slow - > 50 ms — large feature space, iterative convergence, or many property lookups .. note:: Autocorrelation, PseAAC, and APAAComp scale with both sequence length **and** ``lag`` — reduce ``lag`` or the number of properties to speed them up. Tripeptide composition always produces 8000 columns regardless of sequence length. .. list-table:: :header-rows: 1 :widths: 28 35 12 15 7 8 * - Descriptor - Function - Output shape - Category - Speed - Complexity * - Amino Acid Composition - ``amino_acid_composition(seq)`` - 1 × 20 - Composition - ✅ Fast - O(n) * - Dipeptide Composition - ``dipeptide_composition(seq)`` - 1 × 400 - Composition - ✅ Fast - O(n) * - Tripeptide Composition - ``tripeptide_composition(seq)`` - 1 × 8000 - Composition - 🟡 Moderate - O(n) * - GRAVY - ``gravy(seq)`` - 1 × 1 - Composition - ✅ Fast - O(n) * - Aromaticity - ``aromaticity(seq)`` - 1 × 1 - Composition - ✅ Fast - O(n) * - Instability Index - ``instability_index(seq)`` - 1 × 1 - Composition - ✅ Fast - O(n) * - Isoelectric Point - ``isoelectric_point(seq)`` - 1 × 1 - Composition - 🟡 Moderate - O(n · iter) * - Molecular Weight - ``molecular_weight(seq)`` - 1 × 1 - Composition - ✅ Fast - O(n) * - Charge Distribution - ``charge_distribution(seq, ph=7.4)`` - 1 × 3 - Composition - ✅ Fast - O(n) * - Hydrophobic/Polar/Charged - ``hydrophobic_polar_charged_composition(seq)`` - 1 × 3 - Composition - ✅ Fast - O(n) * - Secondary Structure Propensity - ``secondary_structure_propensity(seq)`` - 1 × 3 - Composition - ✅ Fast - O(n) * - k-mer Composition - ``kmer_composition(seq, k=2)`` - 1 × 20^k - Composition - 🟡 Moderate - O(n · 20^k) * - Reduced Alphabet Composition - ``reduced_alphabet_composition(seq, alphabet_size=6)`` - 1 × alphabet_size - Composition - ✅ Fast - O(n) * - Motif Composition - ``motif_composition(seq, motifs=None)`` - 1 × len(motifs) - Composition - 🟡 Moderate - O(n · m) * - Amino Acid Pair Composition - ``amino_acid_pair_composition(seq)`` - 1 × 400 - Composition - ✅ Fast - O(n) * - Aliphatic Index - ``aliphatic_index(seq)`` - 1 × 1 - Composition - ✅ Fast - O(n) * - Extinction Coefficient - ``extinction_coefficient(seq)`` - 1 × 2 - Composition - ✅ Fast - O(n) * - Boman Index - ``boman_index(seq)`` - 1 × 1 - Composition - ✅ Fast - O(n) * - Aggregation Propensity - ``aggregation_propensity(seq, window=5)`` - 1 × 2 - Composition - 🟡 Moderate - O(n · win) * - Hydrophobic Moment - ``hydrophobic_moment(seq, window=11, angle=100)`` - 1 × 2 - Composition - 🟡 Moderate - O(n · win) * - Shannon Entropy - ``shannon_entropy(seq)`` - 1 × 1 - Composition - ✅ Fast - O(n) * - Pseudo AAComp - ``pseudo_amino_acid_composition(seq, lamda=30, weight=0.05)`` - 1 × (20 + lamda) - Composition - 🔴 Slow - O(n · lamda · props) * - Amphiphilic Pseudo AAComp - ``amphiphilic_pseudo_amino_acid_composition(seq, lamda=30, weight=0.5)`` - 1 × (20 + 2×lamda) - Composition - 🔴 Slow - O(n · lamda · props) * - Moreaubroto Autocorrelation - ``moreaubroto_autocorrelation(seq, lag=30)`` - 1 × (lag × props) - Autocorrelation - 🔴 Slow - O(n · lag · props) * - Moran Autocorrelation - ``moran_autocorrelation(seq, lag=30)`` - 1 × (lag × props) - Autocorrelation - 🔴 Slow - O(n · lag · props) * - Geary Autocorrelation - ``geary_autocorrelation(seq, lag=30)`` - 1 × (lag × props) - Autocorrelation - 🔴 Slow - O(n · lag · props) * - Conjoint Triad - ``conjoint_triad(seq)`` - 1 × 343 - Conjoint Triad - ✅ Fast - O(n) * - CTD Composition - ``ctd_composition(seq, property="hydrophobicity")`` - 1 × 3 - CTD - ✅ Fast - O(n) * - CTD Transition - ``ctd_transition(seq, property="hydrophobicity")`` - 1 × 3 - CTD - ✅ Fast - O(n) * - CTD Distribution - ``ctd_distribution(seq, property="hydrophobicity")`` - 1 × 15 - CTD - ✅ Fast - O(n) * - CTD Combined - ``ctd_(seq, property="hydrophobicity", all_ctd=True)`` - 1 × 147 - CTD - 🟡 Moderate - O(n · props) * - SOCN (single) - ``sequence_order_coupling_number_(seq, d=1)`` - float - Sequence Order - ✅ Fast - O(n) * - SOCN (series) - ``sequence_order_coupling_number(seq, lag=30)`` - 1 × lag - Sequence Order - 🟡 Moderate - O(n · lag) * - SOCN (all matrices) - ``sequence_order_coupling_number_all(seq, lag=30)`` - 1 × (2 × lag) - Sequence Order - 🟡 Moderate - O(n · lag) * - Quasi Sequence Order - ``quasi_sequence_order(seq, lag=30, weight=0.1)`` - 1 × (20 + lag) - Sequence Order - 🟡 Moderate - O(n · lag) * - Quasi Sequence Order (all) - ``quasi_sequence_order_all(seq, lag=30, weight=0.1)`` - 1 × (2 × (20 + lag)) - Sequence Order - 🟡 Moderate - O(n · lag) ---- References ---------- The descriptors implemented in **protpy** are based on the following published methods. Composition ~~~~~~~~~~~ - Amino acid, dipeptide, and tripeptide composition: Nakashima, H., Nishikawa, K., & Ooi, T. (1986). The folding type of a protein is relevant to the amino acid composition. *Journal of Biochemistry*, 99(1), 153–162. - GRAVY: Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. *Journal of Molecular Biology*, 157(1), 105–132. - Aromaticity: Lobry, J. R., & Gautier, C. (1994). Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 *Escherichia coli* chromosome-encoded genes. *Nucleic Acids Research*, 22(15), 3174–3180. - Instability index: Guruprasad, K., Reddy, B. V. B., & Pandit, M. W. (1990). Correlation between stability of a protein and its dipeptide composition. *Protein Engineering*, 4(2), 155–161. - Isoelectric point: Bjellqvist, B., et al. (1994). The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. *Electrophoresis*, 14(1), 1023–1031. - Molecular weight, isoelectric point, charge: Gasteiger, E., et al. (2005). Protein identification and analysis tools on the ExPASy server. In *The Proteomics Protocols Handbook*, Humana Press, 571–607. - Secondary structure propensity: Chou, P. Y., & Fasman, G. D. (1974). Prediction of protein conformation. *Biochemistry*, 13(2), 222–245. - Aliphatic index: Ikai, A. J. (1980). Thermostability and aliphatic index of globular proteins. *Journal of Biochemistry*, 88(6), 1895–1898. - Extinction coefficient: Pace, C. N., et al. (1995). How to measure and predict the molar absorption coefficient of a protein. *Protein Science*, 4(11), 2411–2423. - Boman index: Boman, H. G. (2003). Antibacterial peptides: basic facts and emerging concepts. *Journal of Internal Medicine*, 254(3), 197–215. - Hydrophobic moment: Eisenberg, D., Weiss, R. M., & Terwilliger, T. C. (1982). The helical hydrophobic moment: a measure of the amphiphilicity of a helix. *Nature*, 299, 371–374. - Shannon entropy: Shannon, C. E. (1948). A mathematical theory of communication. *Bell System Technical Journal*, 27(3), 379–423. - Pseudo amino acid composition (PseAAC): Chou, K.-C. (2001). Prediction of protein cellular attributes using pseudo-amino acid composition. *Proteins: Structure, Function, and Bioinformatics*, 43(3), 246–255. - Amphiphilic PseAAC (APseAAC): Chou, K.-C. (2005). Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. *Bioinformatics*, 21(1), 10–19. Autocorrelation ~~~~~~~~~~~~~~~ - Moreau-Broto autocorrelation: Moreau, G., & Broto, P. (1980). The autocorrelation of a topological structure: A new molecular descriptor. *Nouveau Journal de Chimie*, 4, 359–360. - Moran autocorrelation: Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. *Biometrika*, 37(1–2), 17–23. - Geary autocorrelation: Geary, R. C. (1954). The contiguity ratio and statistical mapping. *The Incorporated Statistician*, 5(3), 115–145. - AAIndex properties: Kawashima, S., & Kanehisa, M. (2000). AAindex: amino acid index database. *Nucleic Acids Research*, 28(1), 374. Conjoint Triad ~~~~~~~~~~~~~~ - Liu, B., et al. (2008). Prediction of protein-protein interactions based on the naive Bayes classifier with amino acid composition features. *Biochemical and Biophysical Research Communications*, 368(2), 462–468. CTD ~~~ - Dubchak, I., et al. (1995). Prediction of protein folding class using global description of amino acid sequence. *PNAS*, 92(19), 8700–8704. - Dubchak, I., et al. (1999). Recognition of a protein fold in the context of the SCOP classification. *Proteins*, 35(4), 401–407. Sequence Order ~~~~~~~~~~~~~~ - Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. *Science*, 185(4154), 862–864. - Schneider, G., & Wrede, P. (1994). The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution. *Biophysical Journal*, 66(2), 335–344. - Chou, K.-C. (2000). Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. *Biochemical and Biophysical Research Communications*, 278(2), 477–483.