API Reference
All public functions are importable directly from the protpy namespace after installation.
Composition
- protpy.composition.amino_acid_composition(sequence: str) DataFrame[source]
Calculate Amino Acid Composition (AAComp) of protein sequence. AAComp describes the fraction of each amino acid type within a protein sequence, and is calculated as:
AA_Comp(s) = AA(t)/N(s)
where AA_Comp(s) is the AAComp of protein sequence s, AA(t) is the number of amino acid types t (where t = 1,2,..,20) and N(s) is the length of the sequences [1].
:param : protein sequence. :type : sequence: str
- Returns:
pandas dataframe of AAComp for protein sequence. Dataframe will be of the shape 1 x 20, where 20 is the number of features calculated from the descriptor (for the 20 amino acids) and 1 is the input sequence.
- Return type:
amino_acid_composition_df: pd.DataFrame
- protpy.composition.dipeptide_composition(sequence: str) DataFrame[source]
Calculate Dipeptide Composition (DPComp) for protein sequence. Dipeptide composition is the fraction of each dipeptide type within a protein sequence. With dipeptides being of length 2 and there being 20 canonical amino acids this creates 20^2 different combinations, thus a 400-Dimensional vector will be produced such that:
DPComp(s,t) = AA(s,t) / N -1
where DPComp(s,t) is the dipeptide composition of the protein sequence for amino acid type s and t (where s and t = 1,2,..,20), AA(s,t) is the number of dipeptides represented by amino acid type s and t and N is the total number of dipeptides [1].
:param : protein sequence. :type : sequence: str
- Returns:
pandas dataframe of dipeptide composition for protein sequence. Dataframe will be of the shape 1 x 400, where 400 is the number of features calculated from the descriptor (20^2 for the 20 canonical amino acids) and 1 is the input sequence.
- Return type:
dipeptide_composition_df: pd.DataFrame
- protpy.composition.tripeptide_composition(sequence: str) DataFrame[source]
Calculate Tripeptide Composition (TPComp) of protein sequence. Tripeptide composition is the fraction of each tripeptide type within a protein sequence. With tripeptides being of length 3 and there being 20 canonical amino acids this creates 20^3 different combinations, thus a 8000-Dimensional vector will be produced such that:
TPComp(s,t,u) = AA(s,t,u) / N - 1
where TPComp(s,t,u) is the tripeptide composition of the protein sequence for amino acid type s, t and u (where s, t and u = 1,2,..,20), AA(s,t,u) is the number of tripeptides represented by amino acid type s and t, and N is the total number of tripeptides [1].
:param : protein sequence in str form. :type : sequence: str
- Returns:
pandas DataFrame of tripeptide composition for protein sequence. Dataframe will be of the shape 1 x 8000, where 8000 is the number of features calculated from the descriptor (20^3 for the 20 canonical amino acids) and 1 is the input sequence.
- Return type:
tripeptide_composition_df: pd.DataFrame
- protpy.composition.gravy(sequence: str) DataFrame[source]
Calculate the Grand Average of Hydropathy (GRAVY) for a protein sequence. GRAVY is the sum of the Kyte-Doolittle hydropathy values of all residues divided by the sequence length [8]:
GRAVY = (sum of hydropathy values) / N
A positive GRAVY value indicates an overall hydrophobic protein; a negative value indicates an overall hydrophilic protein.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame containing the GRAVY score for the input sequence. Dataframe will be of the shape 1 x 1.
- Return type:
gravy_df: pd.DataFrame
- protpy.composition.aromaticity(sequence: str) DataFrame[source]
Calculate Aromaticity of a protein sequence. Aromaticity is the fraction of aromatic amino acids (Phe, Trp, Tyr) in the sequence [9]:
Aromaticity = (F + W + Y) / N
where F, W and Y are the counts of Phe, Trp and Tyr residues, and N is the total sequence length.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 1 containing the aromaticity score.
- Return type:
aromaticity_df: pd.DataFrame
- protpy.composition.instability_index(sequence: str) DataFrame[source]
Calculate the Instability Index (II) of a protein sequence. The instability index is a measure of in vivo stability, computed as a weighted sum of dipeptide instability weight values (DIWV) [10]:
II = (10 / N) * sum(DIWV(x_i, x_{i+1})) for i in 1..N-1
Proteins with II < 40 are predicted to be stable; II >= 40 indicates instability.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 1 containing the instability index.
- Return type:
instability_index_df: pd.DataFrame
- protpy.composition.isoelectric_point(sequence: str) DataFrame[source]
Calculate the theoretical Isoelectric Point (pI) of a protein sequence. pI is the pH at which the net charge of the protein is zero. It is computed by iteratively adjusting pH until positive and negative charges balance [11].
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 1 containing the pI value.
- Return type:
isoelectric_point_df: pd.DataFrame
- protpy.composition.molecular_weight(sequence: str) DataFrame[source]
Calculate the Molecular Weight (MW) of a protein sequence. MW is computed as the sum of average residue masses minus one water molecule per peptide bond (18.015 Da) [12]:
MW = sum(residue_mass_i) - (N - 1) * 18.015
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 1 containing the molecular weight (Da).
- Return type:
molecular_weight_df: pd.DataFrame
- protpy.composition.charge_distribution(sequence: str, ph: float = 7.4) DataFrame[source]
Calculate Charge Distribution of a protein sequence at a given pH. Positive charge is contributed by Lys (K), Arg (R) and His (H); negative charge by Asp (D) and Glu (E). Henderson-Hasselbalch equations are used [13]:
positive = sum(count(aa) / (1 + 10^(pH - pKa))) for aa in {K, R, H} negative = sum(count(aa) / (1 + 10^(pKa - pH))) for aa in {D, E} net = positive - negative
:param : protein sequence. :type : sequence: str :param : pH at which to evaluate charges. :type : ph: float (default=7.4)
- Returns:
pandas DataFrame of shape 1 x 3 with columns [“PositiveCharge”, “NegativeCharge”, “NetCharge”].
- Return type:
charge_distribution_df: pd.DataFrame
- protpy.composition.hydrophobic_polar_charged_composition(sequence: str) DataFrame[source]
Calculate Hydrophobic, Polar and Charged Composition of a protein sequence. Residues are grouped into three physicochemical classes [1]:
Hydrophobic (nonpolar): A, C, F, I, L, M, V, W, Y
Polar (uncharged): G, N, Q, S, T
Charged: D, E, R, H, K
Each value is expressed as a percentage of the total sequence length.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 3 with columns [“Hydrophobic”, “Polar”, “Charged”].
- Return type:
hpc_df: pd.DataFrame
- protpy.composition.secondary_structure_propensity(sequence: str) DataFrame[source]
Calculate Secondary Structure Propensity of a protein sequence. Each residue is assigned Chou-Fasman propensity values for alpha-helix (H), beta-sheet (E) and coil/loop (C) conformations [14]. The mean propensity for each secondary structure class across the whole sequence is returned.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 3 with columns [“Helix”, “Sheet”, “Coil”].
- Return type:
ssp_df: pd.DataFrame
- protpy.composition.kmer_composition(sequence: str, k: int = 2) DataFrame[source]
Calculate k-mer Composition of a protein sequence. A k-mer is a subsequence of length k. For each of the 20^k possible k-mers the frequency relative to the total number of k-mers in the sequence is calculated [1].
:param : protein sequence. :type : sequence: str :param : length of k-mer subsequences (k >= 1). :type : k: int (default=2)
- Returns:
pandas DataFrame of shape 1 x 20^k containing fractional k-mer frequencies.
- Return type:
kmer_composition_df: pd.DataFrame
- protpy.composition.reduced_alphabet_composition(sequence: str, alphabet_size: int = 6) DataFrame[source]
Calculate Reduced Alphabet Composition of a protein sequence. The 20 standard amino acids are clustered into a smaller alphabet of physicochemically similar groups. The fraction of residues belonging to each group is returned [1].
Supported alphabet sizes: 2, 3, 4, 6 (default=6). Invalid sizes are reset to 6.
:param : protein sequence. :type : sequence: str :param : number of amino acid groups (2, 3, 4 or 6). :type : alphabet_size: int (default=6)
- Returns:
pandas DataFrame of shape 1 x alphabet_size.
- Return type:
reduced_alphabet_df: pd.DataFrame
- protpy.composition.motif_composition(sequence: str, motifs: dict[str, str] | None = None) DataFrame[source]
Calculate Motif Composition of a protein sequence. For each motif in the provided dictionary, the number of regex matches within the sequence is counted [15]. If no motifs are supplied, a set of biologically relevant default motifs is used.
:param : protein sequence. :type : sequence: str :param : dictionary mapping motif names (str) to regex patterns (str). If None, the built-in default_motifs dict is used. :type : motifs: dict or None (default=None)
- Returns:
pandas DataFrame of shape 1 x M, where M is the number of motifs. Each value is the integer count of matches for that motif.
- Return type:
motif_composition_df: pd.DataFrame
- protpy.composition.amino_acid_pair_composition(sequence: str) DataFrame[source]
Calculate Amino Acid Pair Composition (PairComp) of a protein sequence. For each pair of consecutive residues (i, i+1), the fractional frequency of all 400 possible dipeptide types (20 x 20) is computed [1]:
PairComp(s,t) = count(s,t) / (N - 1) * 100
Unlike the standard dipeptide composition, results are annotated with the physicochemical class of each residue in the pair (Hydrophobic/Polar/Charged).
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 400, where each column label has the form ‘XY_Class1-Class2’ (e.g. ‘AL_Hydrophobic-Hydrophobic’).
- Return type:
pair_composition_df: pd.DataFrame
- protpy.composition.aliphatic_index(sequence: str) DataFrame[source]
Calculate the Aliphatic Index (AI) of a protein sequence. The aliphatic index is a measure of thermostability, defined as the relative volume occupied by aliphatic side chains (Ala, Val, Ile, Leu) [16]:
AI = Xala + 2.9 * Xval + 3.9 * (Xile + Xleu)
where X values are mole percentages of each residue. Higher values indicate greater thermostability.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 1 containing the aliphatic index score.
- Return type:
aliphatic_index_df: pd.DataFrame
- protpy.composition.extinction_coefficient(sequence: str) DataFrame[source]
Calculate the molar Extinction Coefficient of a protein sequence at 280 nm. The coefficient is derived from the number of Trp (W), Tyr (Y) and Cys (C) residues using the method of Gasteiger et al. (2005) [17]:
EC = (W * 5500) + (Y * 1490) + (SS * 125)
where SS is the number of disulfide bonds (Cys_count // 2) for the oxidised form, and 0 for the reduced form. Both are returned.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 2 with columns
ExtCoeff_ReducedandExtCoeff_Oxidized(M⁻¹cm⁻¹).- Return type:
extinction_coefficient_df: pd.DataFrame
- protpy.composition.boman_index(sequence: str) DataFrame[source]
Calculate the Boman Index (potential protein interaction index) for a protein sequence. The index is the sum of the solubility values of all residues divided by the sequence length [18]:
BI = sum(solubility(aa_i)) / N
Higher values indicate a greater tendency to bind other proteins or cell membranes (e.g. antimicrobial peptides). Values >2.48 suggest high interaction potential.
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 1 containing the Boman Index score.
- Return type:
boman_index_df: pd.DataFrame
- protpy.composition.aggregation_propensity(sequence: str, window: int = 5, hydrophobicity_threshold: float = 2.0, charge_threshold: int = 1) DataFrame[source]
Calculate the Aggregation Propensity of a protein sequence. Aggregation- prone regions (APRs) are identified using a sliding window: a window is classified as an APR when its mean Kyte-Doolittle hydrophobicity exceeds
hydrophobicity_thresholdand the number of charged residues (D, E, K, R) is belowcharge_threshold[8].:param : protein sequence. :type : sequence: str :param : sliding window length in residues. :type : window: int (default=5) :param : minimum mean Kyte-Doolittle hydrophobicity for an APR window. :type : hydrophobicity_threshold: float (default=2.0) :param : maximum number of charged residues (D, E, K, R) allowed in an APR. :type : charge_threshold: int (default=1)
- Returns:
pandas DataFrame of shape 1 x 2 with columns:
AggregProneRegions— number of non-overlapping-scored APR windows;AggregProneFraction— percentage of residues covered by any APR.- Return type:
aggregation_propensity_df: pd.DataFrame
- protpy.composition.shannon_entropy(sequence: str) DataFrame[source]
Calculate the Shannon Entropy of a protein sequence. Shannon entropy quantifies the diversity of the amino acid composition using the information-theoretic formula:
H = -sum_i ( p_i * log2(p_i) )
where p_i is the fractional frequency of amino acid i in the sequence. A value near zero indicates a low-complexity or repetitive sequence; the theoretical maximum is log2(20) ≈ 4.322 bits for a perfectly uniform distribution across all 20 canonical amino acids. Shannon entropy is widely used as a sequence quality filter and complexity measure in machine-learning pipelines [1].
:param : protein sequence. :type : sequence: str
- Returns:
pandas DataFrame of shape 1 x 1 with column
ShannonEntropy. Value is rounded to 3 decimal places.- Return type:
shannon_entropy_df: pd.DataFrame
- protpy.composition.hydrophobic_moment(sequence: str, window: int = 11, angle: float = 100) DataFrame[source]
Calculate the mean and maximum Hydrophobic Moment of a protein sequence. The hydrophobic moment measures the amphiphilicity of a helical segment — the directional asymmetry of hydrophobicity projected around a helical axis [19].
For each window of length
window, the moment is:mu = sqrt( (sum H_i * sin(i * angle))^2 + (sum H_i * cos(i * angle))^2 ) / window
where H_i is the Eisenberg hydrophobicity of residue i and angle is in degrees. Default angle of 100° corresponds to an alpha-helix; 160° is commonly used for beta-strands.
:param : protein sequence. :type : sequence: str :param : sliding window length in residues. :type : window: int (default=11) :param : residue rotation angle in degrees (100° = alpha-helix). :type : angle: float (default=100)
- Returns:
pandas DataFrame of shape 1 x 2 with columns
HydrophobicMoment_MeanandHydrophobicMoment_Max.- Return type:
hydrophobic_moment_df: pd.DataFrame
- protpy.composition.pseudo_amino_acid_composition(sequence: str, lamda: int = 30, weight: float = 0.05, properties: list[str] | None = None) DataFrame[source]
Pseudo amino acid composition (PAAComp) combines the vanilla amino acid composition descriptor with additional local features, such as correlation between residues of a certain distance, as amino acid composition doesn’t take into account sequence order info. The pseudo components of the descriptor are a series rank-different correlation factors [4, 5].
The first 20 components are a weighted sum of the amino acid composition and 30 are physicochemical square correlations as dictated by the lamda and properties parameters. This generates an output of [1, (20 + lamda)] = 1 x 50 when using the default lamda of 30. By default, the physicochemical properties used are hydrophobicity and hydrophillicity, with a lamda of 30 and weight of 0.05.
:param : rank of correlation. Number of calculable descriptors depends on lamda value. :type : lamda: int :param : weighting factor. :type : weight: float :param : list of dicts of physicochemical/structural property values for amino acids. :type : properties: list
- Returns:
output dataframe of calculated pseudo amino acid composition descriptors for input sequence. Dataframe will be of the shape 1 x N, where N is the number of features calculated from the descriptor (20 + lamda) and 1 is the input sequence. By default, the shape will be 1 x 50 (using default lamda=30).
- Return type:
pseudo_amino_acid_composition_df: pd.Dataframe
- protpy.composition.sequence_order_correlation_factor(sequence: str, k: int = 1, properties: list[dict[str, float]] | None = None) float[source]
Calculating sequence order correlation factor with gap equal to k based on the given input properities for a protein sequence.
:param : gap between amino acids in the sequence. :type : k: int :param : list of dicts of physicochemical/structural property values for amino acids. :type : properties: list
- Returns:
correlation factor value for the sequence using the correlation function for the particular sequence.
- Return type:
result: float
- protpy.composition.correlation_function(Ri: str = 'S', Rj: str = 'D', properties: list[dict[str, float]] | None = None) float[source]
Calculate the correlation between 2 input amino acids based on the physicochemical/structural properties from the protein sequence.
:param : 1st amino acid. :type : Ri: str :param : 2nd amino acid. :type : Rj: str :param : list of dicts of physicochemical/structural property values for amino acids. :type : properties: list
- Returns:
correlation value for the two input amino acids based on input properties.
- Return type:
correlation: float
- protpy.composition.normalize_property(properties: dict[str, float]) dict[str, float][source]
Normalize physicochemical/structural property values using their mean and standard deviation.
:param : dictionary of amino acid values for a physicochemical property. :type : properties: dict
- Returns:
dict of normalized property values.
- Return type:
normalized_vals: dict
- protpy.composition.amphiphilic_pseudo_amino_acid_composition(sequence: str, lamda: int = 30, weight: float = 0.5, properties: list[dict[str, float]] = [{'A': 0.62, 'C': 0.29, 'D': -0.9, 'E': -0.74, 'F': 1.19, 'G': 0.48, 'H': -0.4, 'I': 1.38, 'K': -1.5, 'L': 1.06, 'M': 0.64, 'N': -0.78, 'P': 0.12, 'Q': -0.85, 'R': -2.53, 'S': -0.18, 'T': -0.05, 'V': 1.08, 'W': 0.81, 'Y': 0.26}, {'A': -0.5, 'C': -1.0, 'D': 3.0, 'E': 3.0, 'F': -2.5, 'G': 0.0, 'H': -0.5, 'I': -1.8, 'K': 3.0, 'L': -1.8, 'M': -1.3, 'N': 0.2, 'P': 0.0, 'Q': 0.2, 'R': 3.0, 'S': 0.3, 'T': -0.4, 'V': -1.5, 'W': -3.4, 'Y': -2.3}]) DataFrame[source]
Amphiphillic pseudo amino acid composition (APAAComp) has the same form as the amino acid composition, but contains much more information that is related to the sequence order of a protein and the distribution of the hydrophobic and hydrophilic amino acids along its chain [5].
The first 20 numbers in the descriptor are the components of the conventional amino acid composition; the next 2*λ numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain.
:param : protein sequence. :type : sequence: str :param : rank of correlation. Number of calculable descriptors depends on lambda value. :type : lamda: int :param : weighting factor. :type : weight: float :param : list of dicts of physicochemical/structural property values for amino acids. :type : properties: list (default=[hydrophobicity, hydrophilicity])
- Returns:
output dataframe of calculated amphiphilic pseudo amino acid composition descriptors for input sequence. Dataframe will be of the shape 1 x N, where N is the number of features calculated from the descriptor (20 + 2*lambda) and 1 is the input sequence. By default, the shape will be 1 x 80 (using default lamda=30).
- Return type:
amp_pseudo_amino_acid_composition_df: pd.Dataframe
- protpy.composition.amphiphilic_sequence_order_correlation_factor(sequence: str, k: int = 1) list[float][source]
Calculate Amphipillic sequence order correlation factor for sequence with gap=k.
:param : protein sequence. :type : sequence: str :param : gap between amino acids in the sequence. :type : k: int (default=1)
- Returns:
list of correlation factors for both hydrophobicity and hydrophillicty.
- Return type:
correlation_factor: list
- protpy.composition.amphiphilic_correllation_function(Ri: str = 'S', Rj: str = 'D') tuple[float, float][source]
Calculate correlation value based on hydrophobicity and hydrophillicity property values for input amino acids Ri and Rj in sequence.
:param : 1st amino acid. :type : Ri: str :param : 2nd amino acid. :type : Rj: str
- Returns:
correlation values for input property values.
- Return type:
theta1, theta2: float
Autocorrelation
- protpy.autocorrelation.moreaubroto_autocorrelation(sequence: str, lag: int = 30, properties: list[str] = ['CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'CHOC760101', 'BIGC670101', 'CHAM810101', 'DAYM780201'], normalize: bool = True) DataFrame[source]
Calculate MoreauBrotoAuto Autocorrelation (MBAuto) descriptor for sequence. Autocorrelation descriptors are a class of topological descriptors, also known as molecular connectivity indices, that describe the level of correlation between two objects (protein or peptide sequences) in terms of their specific structural or physicochemical properties, which are defined based on the distribution of amino acid properties along the sequence. By default, 8 amino acid properties are used for deriving the descriptors. The derivations and detailed explanations of this type of descriptor is outlined in [1].
The MBAuto descriptor is a type of Autocorrelation descriptor that uses the property values as the basis for measurement [2]. Each autocorrelation will generate the number of features depending on the lag value and number of properties input with total features = lag * number of properties. The output autocorrelation can also be normalized by setting the normalize parameter to true, this occurs by default. Using the default 8 properties with default lag value of 30, 240 features are generated, the default 8 properties are:
AccNo. CIDH920105 - Normalized Average Hydrophobicity Scales AccNo. BHAR880101 - Average Flexibility Indices AccNo. CHAM820101 - Polarizability Parameter AccNo. CHAM820102 - Free Energy of Solution in Water, kcal/mole AccNo. CHOC760101 - Residue Accessible Surface Area in Tripeptide AccNo. BIGC670101 - Residue Volume AccNo. CHAM810101 - Steric Parameter AccNo. DAYM780201 - Relative Mutability
:param : protein sequence. :type : sequence: str :param : A value for a lag, the max value is equal to the length of the shortest peptide minus one. :type : lag: int (default=30) :param : list of AAI index record codes/accession numbers for the physicochemical properties to use in the calculation. Defaults to the 8 standard AAIndex properties listed above. :type : properties: list :param : rescale/normalize MoreauBroto Autocorrelation values into range of 0-1. :type : normalize: bool (default=True)
- Returns:
pandas Dataframe of MBAuto values for protein sequence. Output will be of the shape 1 x N, where N is the number of features calculated from the descriptor and 1 is the input sequence. By default, the shape will be 1 x 240 (30 features per property - using 8 properties and lag=30).
- Return type:
moreaubroto_autocorrelation_df: pd.Dataframe
- protpy.autocorrelation.moran_autocorrelation(sequence: str, lag: int = 30, properties: list[str] = ['CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'CHOC760101', 'BIGC670101', 'CHAM810101', 'DAYM780201'], normalize: bool = True) DataFrame[source]
Refer to MBAuto docstring for autocorrelation explanation. Moran autocorrelation (MAuto) utilizes property deviations from the average values [2].
:param : protein sequence. :type : sequence: str :param : A value for a lag, the max value is equal to the length of the shortest peptide minus one. :type : lag: int (default=30) :param : list of AAI index record codes/accession numbers for the physicochemical properties to use in the calculation. Defaults to the same 8 standard AAIndex properties as MBAuto. :type : properties: list :param : rescale/normalize MoreauBroto Autocorrelation values into range of 0-1. :type : normalize: bool (default=True)
- Returns:
pandas Dataframe of MAuto values for protein sequence. Output will be of the shape 1 x N, where N is the number of features calculated from the descriptor and 1 is the input sequence. By default, the shape will be 1 x 240 (30 features per property - using 8 properties and lag=30).
- Return type:
moran_autocorrelation_df: pd.DataFrame
- protpy.autocorrelation.geary_autocorrelation(sequence: str, lag: int = 30, properties: list[str] = ['CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'CHOC760101', 'BIGC670101', 'CHAM810101', 'DAYM780201'], normalize: bool = True) DataFrame[source]
Refer to MBAuto docstring for autocorrelation description. Geary Autocorrelation (GAuto) utilizes the square-difference of property values instead of vector-products (of property values or deviations) [2].
:param : protein sequence. :type : sequence: str :param : A value for a lag, the max value is equal to the length of the shortest peptide minus one. :type : lag: int (default=30) :param : list of AAI index record codes/accession numbers for the physicochemical properties to use in the calculation. Defaults to the same 8 standard AAIndex properties as MBAuto. :type : properties: list :param : rescale/normalize MoreauBroto Autocorrelation values into range of 0-1. :type : normalize: bool (default=True)
- Returns:
pandas Dataframe of MAuto values for protein sequence. Output will be of the shape 1 x N, where N is the number of features calculated from the descriptor and 1 is the input sequence. By default, the shape will be 1 x 240 (30 features per property - using 8 properties and lag=30).
- Return type:
geary_autocorrelation_df: pd.DataFrame
Conjoint Triad
- protpy.conjoint_triad.conjoint_triad(sequence: str) DataFrame[source]
Calculate Conjoint Triad features (CTriad) for a protein sequence. The descriptor mainly considers neighbor relationships in protein sequences by encoding each protein sequence using the triad (continuous three amino acids) frequency distribution extracted from a 7-letter reduced alphabet [1]. This descriptor calculates 343 different features (7x7x7), with the output being of shape 1 x 343 for a sequence.
:param : protein sequence. :type : sequence: str
- Returns:
pandas Dataframe of CTriad descriptor values for all protein sequences. Dataframe will be of the shape 1 x 343, where 343 is the number of features calculated from the descriptor and 1 is the input sequence.
- Return type:
conjoint_triad_df: pd.Dataframe
CTD
- protpy.ctd.str_to_num(sequence: str, property: dict[str, str]) str[source]
Convert sequences str to number from input physicochemical property.
:param : protein sequence. :type : sequence: str :param : physicochemical property name to use when calculating descriptor. :type : property: str
- Returns:
converted protein sequence into numerical format.
- Return type:
sequence_converted: str
- protpy.ctd.ctd_composition(sequence: str, property: str = 'hydrophobicity') DataFrame[source]
Calculate composition physicochemical/structural descriptor. Composition is determined as the number of amino acids of a particular property divided by the total number of amino acids. The shape of the output will be 1 x 3, with 3 features being generated per sequence.
:param : protein sequence. :type : sequence: str :param : physicochemical property name to use when calculating descriptor. :type : property: str (default=”hydrophocity”)
- Returns:
dataframe of calculated composition values for sequence using selected physicochemical property. Output will be of shape 1 x 3, with 3 features being generated per sequence.
- Return type:
ctd_composition_df: pd.DataFrame
- protpy.ctd.ctd_transition(sequence: str, property: str = 'hydrophobicity') DataFrame[source]
Calculate transition physicochemical/structural descriptor. Transition is determined as the number of transitions from a particular property to different property divided by (total number of amino acids − 1). The shape of the output will be 1 x 3, with 3 features being generated per sequence.
:param : protein sequence. :type : sequence: str :param : physicochemical property name to use when calculating descriptor. :type : property: str (default=”hydrophocity”)
- Returns:
dataframe of calculated transition values for sequence using selected physicochemical property. Output will be of shape 1 x 3, with 3 features being generated per sequence.
- Return type:
ctd_transition_df: pd.DataFrame
- protpy.ctd.ctd_distribution(sequence: str, property: str = 'hydrophobicity') DataFrame[source]
Calculate distribution physicochemical/structural descriptor. Distribution is the chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a particular property are located. The shape of the output will be 1 x 15, with 15 features being generated per sequence.
:param : protein sequence. :type : sequence: str :param : physicochemical property name to use when calculating descriptor. :type : property: str (default=”hydrophocity”)
- Returns:
dataframe of calculated distribution values for sequence using selected physicochemical property. Output will be of shape 1 x 15, with 15 features being generated per sequence.
- Return type:
ctd_distribution_df: pd.DataFrame
- protpy.ctd.ctd_(sequence: str, property: str = 'hydrophobicity', all_ctd: bool = True) DataFrame[source]
Calculate all Composition, Transition and Distribution (CTD) features of protein sequences. Composition is the number of amino acids of a particular property (e.g., hydrophobicity) divided by the total number of amino acids in a protein sequence. Transition characterizes the percent frequency with which amino acids of a particular property is followed by amino acids of a different property. Distribution measures the chain length within which the first, 25%, 50%, 75%, and 100% of the amino acids of a particular property are located, respectively [1, 2].
CTD properties available are: Polarizability, Solvent Accessibility, Secondary Structure, Charge, Polarity, Normalized VDWV, Hydrophobicity. Each property generates an output of shape 1 x 21, 3/21 will be Composition, 3/21 will be Transition, 15/21 will be Distribution. When Calculating all available features the generated output will be of shape 1 x 147, 21/147 will be composition, 21/147 will be transition and the remaining 105 are distribution.
:param : protein sequence. :type : sequence: str :param : physicochemical property name to use when calculating descriptor. :type : property: str (default=”hydrophocity”) :param : calculate all CTD descriptors and concatenate together. :type : ctd: bool
- Returns:
dataframe of CTD descriptor values for all protein sequences. DataFrame will be of the shape 1 x 147, where 147 is the total number of features calculated from the CTD descriptors per sequence, with each property generating an output of 1 x 21.
- Return type:
ctd_df: pd.DataFrame
Sequence Order
- protpy.sequence_order.sequence_order_coupling_number_(sequence: str, d: int = 1, distance_matrix: str = 'schneider-wrede') float[source]
Calculate Sequence Order Coupling Number (SOCN) features for input protein sequence. SOCN computes the dissimilarity between amino acid pairs. The distance between amino acid pairs is determined by d which varies between 1 to lag. For each d, it computes the sum of the dissimilarities of all amino acid pairs. The output will be a single float value representing the SOCN [1].
This function should not be confused with sequence_order_coupling_number() below which calculates the multiple SOCN descriptor values for the input sequence according to the value of lag; with the default of lag=30, 30 SOCN’s will be calculated.
:param : protein sequence. :type : sequence: str :param : gap between two amino acids. :type : d: int (default=1) :param : physicochemical distance matrix to use; accepts “schneider-wrede” or “grantham”. :type : distance_matrix: str (default=”schneider-wrede”)
- Returns:
calculated sequence order coupling number value from sequence.
- Return type:
socn: float
- protpy.sequence_order.sequence_order_coupling_number(sequence: str, lag: int = 30, distance_matrix: str = 'schneider-wrede') DataFrame[source]
Calculate Sequence Order Coupling Number (SOCN) features for input protein sequence. SOCN computes the dissimilarity between amino acid pairs. The distance between amino acid pairs is determined by d which varies between 1 to lag. For each d, it computes the sum of the dissimilarities of all amino acid pairs. The number of output features can be calculated as N, where N = lag, by default this value is 30 which generates an output of 1 x 30.
:param : protein sequence. :type : sequence: str :param : maximum gap between 2 amino acids; the protein length should be larger than lag. :type : lag: int (default=30) :param : physicochemical distance matrix to use; accepts “schneider-wrede” or “grantham”. :type : distance_matrix: str (default=”schneider-wrede”)
- Returns:
Dataframe of SOCN descriptor values for all protein sequences. Output will be of the shape 1 x N, where N is the number of features calculated from the descriptor and N=lag.
- Return type:
sequence_order_df: pd.Dataframe
- protpy.sequence_order.sequence_order_coupling_number_all(sequence: str, lag: int = 30) DataFrame[source]
Calculate Sequence Order Coupling Number (SOCN) descriptor values of input protein sequence using both matrices (schneider-wrede & grantham) [3]. The distance between amino acid pairs is determined by d which varies between 1 to lag. For each d, it computes the sum of the dissimilarities of all amino acid pairs. Each matrix generates an output of 1 x N, where N is the lag, so using the two concatenated matrices, the output will be 1 x (N * 2). With a default lag of 30, this will generate an output of 1 x 60.
:param : protein sequence. :type : sequence: str :param : maximum gap between 2 amino acids; the protein length should be larger than lag. :type : lag: int (default=30)
- Returns:
Concatenated dataframe of SOCN descriptor values of input protein sequence using both distance matrices. The number of output features can be calculated as N * 2, where N = lag, by default this value is 30 which generates an output of 1 x 30 for each matrix, with the two matrices the output will be 1 x (30*2), using the default lag.
- Return type:
socn_all: pd.Dataframe
- protpy.sequence_order.quasi_sequence_order(sequence: str, lag: int = 30, weight: float = 0.1, distance_matrix: str = 'schneider-wrede') DataFrame[source]
Calculate Quasi Sequence Order (QSO) features for the protein sequences. The quasi-sequence-order descriptors were proposed by K.C. Chou, et.al. [1]. They are derived from the distance matrix between the 20 amino acids. By default, the Scheider-Wrede physicochemical distance matrix was used. Also utilised in the descriptor calculation is the Grantham chemical distance matrix. Both of these matrices are used by Grantham et. al. in the calculation of the descriptor [3, 4]. N + 20 values are calculated per sequence, where N is the lag, with a default lag of 30, the output will be 1 x 50. There is also a weighting factor that can be assigned to determine that weight per amino acid.
:param : protein sequence. :type : sequence: str :param : maximum gap between 2 amino acids; the protein length should be larger than lag. :type : lag: int (default=30) :param : weighting factor. :type : weight: float (default = 0.1) :param : physicochemical distance matrix to use; accepts “schneider-wrede” or “grantham”. :type : distance_matrix: str (default=”schneider-wrede”)
- Returns:
dataframe of quasi-sequence-order descriptor values for the protein sequences, with output shape 1 x (N + 20), where N is the lag. With a default lag of 30 the output will be 1 x 50 per sequence.
- Return type:
quasi_sequence_order_df: pd.Dataframe
- protpy.sequence_order.quasi_sequence_order_all(sequence: str, lag: int = 30, weight: float = 0.1) DataFrame[source]
Calculate Quasi Sequence Order features for input protein sequence using both physicochemical distance matrices. Concatenate into one output dataframe. The output will be in the shape 1 x ((N + 20)*2), where ((N + 20)*2) is the quasi sequence order output from one matrix and N is the lag. The output is multiplied by two to take into account the 2 matrices being concatenated. There is also a weighting factor that can be assigned to determine that weight per amino acid.
:param : protein sequence. :type : sequence: str :param : maximum gap between 2 amino acids; the protein length should be larger than lag. :type : lag: int (default=30) :param : weighting factor. :type : weight: float (default=0.1)
- Returns:
dataframe of quasi-sequence-order descriptor values for the protein sequences, with output shape 1 x ((N + 20)*2) where ((N + 20)*2) is the quasi sequence order output from one matrix and N is the lag. The output is multiplied by two to take into account the 2 matrices being concatenated.
- Return type:
quasi_sequence_order_all_df: pd.Dataframe
Note
A demo of the software and API is available here.