Abstract

In recent years, peptides have gained significant relevance due to their therapeutic properties. The surge in peptide production and synthesis has generated vast amounts of data, enabling the creation of comprehensive databases and information repositories. Advances in sequencing techniques and artificial intelligence have further accelerated the design of tailor-made peptides. However, leveraging these techniques requires versatile and continuously updated storage systems, along with tools that facilitate peptide research and the implementation of machine learning for predictive systems. This work introduces Peptipedia v2.0, one of the most comprehensive public repositories of peptides, supporting biotechnological research by simplifying peptide study and annotation. Peptipedia v2.0 has expanded its collection by over 45% with peptide sequences that have reported biological activities. The functional biological activity tree has been revised and enhanced, incorporating new categories such as cosmetic and dermatological activities, molecular binding, and antiageing properties. Utilizing protein language models and machine learning, more than 90 binary classification models have been trained, validated, and incorporated into Peptipedia v2.0. These models exhibit average sensitivities and specificities of 0.877±0.0530 and 0.873±0.054, respectively, facilitating the annotation of more than 3.6 million peptide sequences with unknown biological activities, also registered in Peptipedia v2.0. Additionally, Peptipedia v2.0 introduces description tools based on structural and ontological properties and user-friendly machine learning tools to facilitate the application of machine learning strategies to study peptide sequences.

Database URL: https://peptipedia.cl/

Introduction

Peptides are versatile biomolecules that can be synthetic or found in natural sources and are attractive candidates for therapeutic applications [1–3]. Peptides play crucial roles in numerous biological processes. Their functions are diverse, serving as structural components, enzymatic inhibitors, cell-penetrating agents, hormones, host defence molecules, and neurotransmitters [3]. Additionally, peptides act as cell surface receptors [4] and are integral to drug delivery applications [5].

Peptide drugs present several potential advantages over traditional small-molecule drugs, such as increased selectivity, affinity, efficacy, and safety, along with reduced toxicity and immunogenicity [6]. However, their widespread clinical application is hindered by challenges, including a short half-life, limited oral bioavailability, and susceptibility to plasma degradation [7].

The therapeutic market started 100 years ago using insulin to treat type 1 diabetes [8, 9]. To date, over 100 peptide drugs are currently available in the global market for various diseases, including human immunodeficiency virus (HIV) infection, chronic pain, metabolic disorders, and infectious diseases [3, 10, 11]. The global therapeutic peptide market reached a value of US$42.05 billion in 2022 [12]. Due to research efforts, developed technology, and the investment of pharmaceutical companies, it is expected that peptide drug discovery and development will continue to expand in the upcoming years, foreseen to reach US$68.6 billion by 2030 (with a compound annual growth rate 6.3% from 2023 to 2030) [1, 12–14].

The lack of structured and specialized peptide repositories drove the development of Peptipedia, an integrated peptide database, and its user-friendly web platform [15]. Peptipedia v1.0 aggregated over 92 000 peptide sequences and more than 75 000 peptides with documented biological activities from 30 diverse data sources. This centralized repository not only offered a comprehensive collection of peptide data but also included valuable tools such as a physicochemical properties estimator, statistical characterization of amino acid sequences, and bioinformatics resources like sequence alignment to enhance peptide research. Moreover, Peptipedia v1.0 incorporated machine learning-based binary classification models to predict potential biological activities, including antimicrobial, antiviral, and antibacterial properties. Integrating these features in a single platform significantly advanced the accessibility and analysis of peptide-related information, underscoring the importance of an integrated peptide database for the scientific community.

Before Peptipedia v1.0, peptide information was widely spread in diverse databases such as Database of Antimicrobial Activity and Structure of Peptides, Erop-Moscow, Linking Antimicrobial Peptides database (LAMP), Data Repository of Antimicrobial Peptides (DRAMP), and Structurally annotated therapeutic peptides database (SATPdb) [16–20]. Peptide-related databases comprehend from a very specialized topic, i.e. blood-brain barrier peptides [21], quorum sensing peptides [22], anti-angiogenic peptides predictor [23], and bacteriocin peptides [24], to general databases, i.e. the Universal Protein Resource (UniProt) [25]; LAMP2, a generic antimicrobial peptide database [18]; DRAMP, a data repository of antimicrobial peptides [19]; and SATPdb, a database of structurally annotated therapeutic peptides [20].

Despite the availability of numerous databases, integrating information between them remains challenging. Peptides are usually classified by just one functional biological activity. However, moonlight peptides have two or more known activities within the same domain [26], and identifying the potential multiple activities of a peptide is relevant for biotechnological and pharmaceutical industries [27, 28].

Peptipedia is a pivotal tool for peptide research, designed to extract, consolidate, organize, and curate information from multiple databases, significantly enhancing knowledge in the field. While Peptipedia v1.0 served as a powerful resource, its development revealed several areas for improvement. Firstly, the database could benefit from a substantial increase in the number of records and a broader scope of information related to each peptide. Secondly, the classification of biological activities in v1.0 was limited, often relegating many peptide activities to a generic ‘other’ category, reducing the data’s specificity and utility. Lastly, the lack of an automated system for updating newly generated data limited the platform’s ability to stay current and fully realize its potential.

The experience gained from Peptipedia v1.0 was instrumental in shaping the second version of the database. By addressing these identified challenges, Peptipedia v2.0 offers a more comprehensive and dynamic knowledge platform. It provides enhanced data integration, expanded records, a more refined classification system for biological activities, and an automatic updating mechanism. These advancements ensure that Peptipedia continues to serve as a cutting-edge resource for researchers, facilitating deeper insights and fostering innovation in peptide research.

This work presents a significant update to the Peptipedia database, incorporating over 3.8 million peptide sequences from more than 70 data sources. Over 100 000 peptides are documented as having functional biological activity based on the literature. The categories and subcategories of the biological activity tree were meticulously examined and reorganized, with new activities such as anti-ageing, cytokine, and molecular binding included. An automatic update system was also implemented, and the advanced search and downloading functionalities were enhanced.

In Peptipedia v2.0, various types of information have been linked to each peptide sequence, including functional domains [29], gene ontology annotations [30, 31], secondary structure information [32], and 3D structures available through Protein Data Bank (PDB) [33] and AlphaFold [21, 34] databases.

Moreover, Peptipedia v2.0 incorporates data on physicochemical property descriptors, patents, and previous publications. Bioinformatic enrichment analysis and statistical evaluation tools have also been integrated. Furthermore, the functional classification models have been updated, enhancing the precision of individual models and improving their generalization. These trained classification models evaluated over 3.7 million peptide sequences, predicting their functional biological activities.

Finally, Peptipedia v2.0 includes machine learning tools for users to analyse their datasets and build their models. These tools encompass protein language models and classical amino acid coding methods for numerical representation strategies, predictive model training through traditional supervised learning algorithms, pattern recognition using unsupervised learning approaches, and sequence similarity networks combined with community detection techniques for pattern recognition. With these enhancements, users can gain deeper insights and uncover novel characteristics from their datasets. Consequently, all these new tools and updates establish Peptipedia v2.0 as one of the most comprehensive, thoroughly processed, and tractable repositories of peptides currently available.

Methodologies and implementation strategies

Collecting and processing peptide sequences

This work searched databases, datasets, and public repositories related to the study of peptides to update and increase the number of records in the Peptipedia v2.0 database. Different keywords were used to collect the data sources in Google Scholar, like ‘peptides’, ‘AMPs’, ‘neuropeptides’, ‘anticancer peptides’, ‘nutraceutical peptides’, and ‘signal peptides’. Besides, generic protein databases such as UniProtKB [25] and RCSB PDB [35] were incorporated (see section S1 of Supplementary data for more details).

Information related to the characteristics of the reported peptide sequences, including their biological activities, descriptions, experimental information, and related publications or patents, was downloaded from all data sources.

Then, Python scripts were implemented to process the raw data downloaded from the data sources and transform the information for loading into the Peptipedia v2.0.

A length filter was made, containing only peptides with a length equal to or less than 150 residues and higher than three residues. Also, the collected peptides were classified as canonical (with only the 20 natural amino acids) or non-canonical peptides. Then, a semantic analysis was generated to recognize the biological activity of the peptide sequence through the available description in the data sources.

Finally, a loader Python script was implemented to load the register in the Peptipedia v2.0, developing a scalable Extract, Transformation, and Load (ETL) strategy for each utilized data source.

Peptide descriptions and enrichment analysis implementation

Each peptide incorporated in the Peptipedia v2.0 was characterized using the peptide descriptor service and enrichment analysis system.

First, the modlAMP tool [36] was used to calculate the physicochemical properties of canonical peptide sequences. Then, the standalone version of RaptorX-Property tool [32] was employed to predict the secondary structure of all records available in Peptipedia v2.0.

Through homology mechanisms, enrichment analysis was performed for all canonical peptides using the MetaStudent tool [37]. MetaStudent allows the assignment of gene ontology terms from different sources, such as molecular function, cellular localization, and biological process.

Once the gene ontology terms prediction was completed, the results were filtered using the data returned probability score, selecting only results with a probability higher than .5 as elements to incorporate as a descriptor of the analysed peptide sequence. Finally, PfamScan tool [38] was used to estimate protein domain families for all peptides registered in the Peptipedia v2.0.

Training predictive models

This work implements binary classification models to identify peptide sequences’ functional biological activities by combining embedding representation through pretrained models with classic supervised learning algorithms [39, 40].

A binary dataset was constructed for each biological activity identified in the Peptipedia v2.0. The following steps were used to build the datasets: (i) peptides exhibiting the target activity were collected to generate positive examples, (ii) peptides without the target activity were collected to create the negative examples, including, where possible and if the information was available, experimentally validated negative examples. This was especially developed for peptides with antimicrobial, antiviral, or antibacterial activity. In addition, the same peptides used as negative examples in previously reported works have been used as negative examples for different activities [41–44],; (iii) the CD-Hit tool [45] was applied to remove redundancy in each category using a homology percentage of 90% and the rest of the configuration parameters by default [40, 46], and the representative sequences were employed to rebuild the binary classification dataset, (iv) an undersampling strategy was applied to balance the dataset by randomly removing negative examples, and (v) different pretrained models were then used to generate embeddings, representing peptide sequences numerically for training the classification models [47] (see more details in Section S2 of Supplementary data).

Activities with fewer than 50 peptides were excluded because they are classified as Low-N datasets, and more specialized strategies, such as transfer learning, semi-supervised approaches, and contrastive learning methods, are required to train predictive models using these types of datasets [48].

Each dataset built is divided into training and validation, using an 80:20 proportion. Nine supervised learning algorithms, including Decision Tree, Random Forest, and XGBoost, were used to train classification models using the training dataset. A K-fold cross-validation (k = 10) was performed to prevent overfitting. The validation dataset was then used to assess model performance using classical metrics such as precision, recall, accuracy, and F-score [49]. The training process, including the division between the training and validation datasets, training under k-fold cross-validation, and evaluating the performances using the metrics summarized in section S6 of Supplementary Materials were repeated n = 30 times to demonstrate the generalization capacity of the trained strategies, as proposed in our previous work [40].

The combination of supervised learning algorithm and embedding representation is selected using the performances obtained during the training and validation process, applying the following criteria: (i) highest performance during the training stage, (ii) highest performance during the validation stage, and (iii) lowest differences between training and validation to reduce the overfitting. Four classic metrics were employed to evaluate the trained models during the training and validation process, including accuracy, precision, recall, and F-score (see more details in Section S6 of Supplementary data)

Finally, the models were exported in joblib format for integration into the Peptipedia v2.0.

Implementation strategies and availability system

The database available in Peptipedia v2.0 was designed based on a relational schema. PostgreSQL manages all operations over the database. All queries performed to the database are managed through an application programming interface implemented using the Flask framework version 2.3 and SQLAlchemy version 2.

The ETL system, description process, and enrichment analysis strategies were implemented using Python version 3.9. Moreover, each tool and service available in Peptipedia v2.0 were implemented using Python programming version 3.9.

All binary classification models were implemented using Python v3.9.17 and the supervised learning module available on the DMAKit Python library [50]. Moreover, the embedding representation for the peptide sequences was performed using the bioembedding library [47].

Finally, the user-friendly web platform in Peptipedia v2.0 was implemented using the React framework as the front end. The system was deployed using Podman into a public server with AlmaLinux 9 as the operative system. Its hardware characteristics are eight vCPU Cores, 64 GB RAM, 500 GB NVMe, and 32 TB Traffic.

Results and discussion

Peptipedia v2 is a user-friendly web platform designed to facilitate the study of peptide sequences using advanced bioinformatics tools and machine learning strategies. This updated version boasts a database of over 100 000 peptides sourced from |$\gt$| 70 databases. Various physicochemical and thermodynamic properties characterize each peptide.

New features in Peptipedia v2 include enrichment analysis through Gene Ontology terms, secondary structure predictions, and functional domain evaluations, providing more comprehensive information on each peptide sequence. The functional biological activity tree has been enhanced to include more specific cosmetics, dermatology, taste, and molecular binding activities.

Additionally, the platform now implements over 90 binary classification models for biological activity classification, utilizing embedding representations from pretrained models and traditional supervised learning algorithms. Numerous tools, such as advanced search capabilities, peptide sequence downloading, physicochemical characterization, and functional biological activity classification, have been updated or newly incorporated.

Moreover, Peptipedia v2.0 includes an integrative machine learning pipeline to streamline the development of sequence-based predictive models and pattern recognition. An overview of Peptipedia v2.0 is presented in Figure 1, highlighting the most relevant data sources, peptide sequence information, bioinformatics tools, and machine learning applications.

Overview of Peptipedia v2.0. This new version of Peptipedia includes $\gt$100 000 peptides registered with functional biological activities, extracted from $\gt$70 data sources. Different physicochemical properties and enrichment analyses were addressed to characterize the peptides collected. The services and functionalities were updated, incorporating gene ontology and functional domain predictions, secondary structure evaluation, and physicochemical properties estimation. Besides, $\gt$90 functional biological activity classification models were implemented, combining embedding representation through pretrained models and supervised learning algorithms. Finally, customizable machine learning pipelines could be implemented by employing the machine learning tools in Peptipedia v2.0 to facilitate the application of machine learning techniques to study peptide sequences.
Figure 1.

Overview of Peptipedia v2.0. This new version of Peptipedia includes |$\gt$|100 000 peptides registered with functional biological activities, extracted from |$\gt$|70 data sources. Different physicochemical properties and enrichment analyses were addressed to characterize the peptides collected. The services and functionalities were updated, incorporating gene ontology and functional domain predictions, secondary structure evaluation, and physicochemical properties estimation. Besides, |$\gt$|90 functional biological activity classification models were implemented, combining embedding representation through pretrained models and supervised learning algorithms. Finally, customizable machine learning pipelines could be implemented by employing the machine learning tools in Peptipedia v2.0 to facilitate the application of machine learning techniques to study peptide sequences.

New data sources, biological activity tree, and peptide sequences

Peptipedia v2.0 includes 76 data sources, comprising databases, repositories, and datasets (see Section S1 of Supplementary data for more details). These data sources are dedicated to collecting and retrieving peptide sequences alongside their corresponding biological activities. The data sources used in this Peptipedia v2.0 database increased by over 100% in the number of data sources compared to the initial version of Peptipedia [15]. A comprehensive update to the biological activity tree has been executed alongside expanding data sources. Figure 2 summarizes the updated functional biological activity tree in Peptipedia v2.0.

Updated functional biological activity tree implemented in Peptipedia v2.0. In this work, we have updated the functional biological activity tree to include new activities related to taste, molecular binding, and cosmetics and dermatology. Additionally, various antiviral activities have been added, specifically targeting virus families and different mechanisms, with a particular focus on HIV. We have also incorporated peptides associated with therapeutic functions, such as inhibitors, antitoxins, antiallergens, and spermicides. Peptides with toxic effects have also been included, constituting a significant part of the newly reported peptides in Peptipedia v2.0. Lastly, the category of peptides classified as ‘other’ has been updated to encompass diverse biological activities unrelated to the main proposed activities, such as participation in photosynthesis and antibarnacle properties.
Figure 2.

Updated functional biological activity tree implemented in Peptipedia v2.0. In this work, we have updated the functional biological activity tree to include new activities related to taste, molecular binding, and cosmetics and dermatology. Additionally, various antiviral activities have been added, specifically targeting virus families and different mechanisms, with a particular focus on HIV. We have also incorporated peptides associated with therapeutic functions, such as inhibitors, antitoxins, antiallergens, and spermicides. Peptides with toxic effects have also been included, constituting a significant part of the newly reported peptides in Peptipedia v2.0. Lastly, the category of peptides classified as ‘other’ has been updated to encompass diverse biological activities unrelated to the main proposed activities, such as participation in photosynthesis and antibarnacle properties.

In this new version of the biological activity tree, the activities are organized into 11 primary biological activities at the first level of the tree. Besides, the updated tree has incorporated an additional activity called ‘other’. While maintaining activities associated with therapeutic and immunological effects, this version introduces additional biological activities such as peptides inducing toxic effects, taste peptides, and those with cosmetic and dermatological activities (see Table 1 and Section S3 of Supplementary data for more details).

Table 1.

Summary of activities classified in the first level of the updated functional biological tree

Biological activityDefinitionRecordsData sources
TherapeuticTherapeutic peptides are molecules, whether designed or natural, with medical applications that aim to treat or alleviate diseases by exerting various bodily functions.57 75051
Cell–cell communicationPeptides involved in cell-to-cell communication, which can influence cell signaling and cell-mediated immune responses.33436
Drug delivery vehicleThese peptides function as carriers that transport and deliver drugs in a specific manner to target cells or tissues. They can target specific receptors on cells to release their therapeutic cargo.285811
TasteA short chain of amino acids that elicits a specific taste sensation when it interacts with taste receptors on the tongue. Taste peptides can evoke a range of tastes, including sweet, bitter, salty, sour, and umami.1922
Signal peptidePeptides that direct proteins to specific cellular locations used in protein secretion and intracellular transport processes, which may affect cellular communication and protein-mediated immune responses11 8359
PropeptideInactive peptides that are converted to active peptides by processing, act as a precursor to a protein, and are removed during processing, which can influence biological activity and protein regulation.14242
ToxicPeptides that possess harmful properties and can induce toxic effects in biological systems. These peptides may interfere with normal cellular functions, alter metabolic processes, or cause damage to cells and tissues.21 90620
Cosmetic and dermatologySmall-chain amino acid peptides are used in skin care and cosmetic products. These peptides are designed to address various skin-related concerns such as wrinkles, fine lines, loss of elasticity, and other signs of skin ageing.3335
Molecular bindingPeptides that bind specifically to molecules, such as proteins, RNA, or DNA, which may influence the regulation of biological processes and immune responses.3143
ImmunologicalPeptides have properties that influence the body#x2019;s immune response and interact with components of the immune system. They can modulate the function of immune cells and play a role in regulating and modulation of the immune response.425811
NeurologicalEssential peptides in nervous system signalling, influencing cellular communication and functions such as memory and neuronal regeneration.10 8865
OtherPeptides with other activities non-generalizable with activities in the first level of the tree, including activities like reductant, pore-forming, or ribonucleoprotein239910
Biological activityDefinitionRecordsData sources
TherapeuticTherapeutic peptides are molecules, whether designed or natural, with medical applications that aim to treat or alleviate diseases by exerting various bodily functions.57 75051
Cell–cell communicationPeptides involved in cell-to-cell communication, which can influence cell signaling and cell-mediated immune responses.33436
Drug delivery vehicleThese peptides function as carriers that transport and deliver drugs in a specific manner to target cells or tissues. They can target specific receptors on cells to release their therapeutic cargo.285811
TasteA short chain of amino acids that elicits a specific taste sensation when it interacts with taste receptors on the tongue. Taste peptides can evoke a range of tastes, including sweet, bitter, salty, sour, and umami.1922
Signal peptidePeptides that direct proteins to specific cellular locations used in protein secretion and intracellular transport processes, which may affect cellular communication and protein-mediated immune responses11 8359
PropeptideInactive peptides that are converted to active peptides by processing, act as a precursor to a protein, and are removed during processing, which can influence biological activity and protein regulation.14242
ToxicPeptides that possess harmful properties and can induce toxic effects in biological systems. These peptides may interfere with normal cellular functions, alter metabolic processes, or cause damage to cells and tissues.21 90620
Cosmetic and dermatologySmall-chain amino acid peptides are used in skin care and cosmetic products. These peptides are designed to address various skin-related concerns such as wrinkles, fine lines, loss of elasticity, and other signs of skin ageing.3335
Molecular bindingPeptides that bind specifically to molecules, such as proteins, RNA, or DNA, which may influence the regulation of biological processes and immune responses.3143
ImmunologicalPeptides have properties that influence the body#x2019;s immune response and interact with components of the immune system. They can modulate the function of immune cells and play a role in regulating and modulation of the immune response.425811
NeurologicalEssential peptides in nervous system signalling, influencing cellular communication and functions such as memory and neuronal regeneration.10 8865
OtherPeptides with other activities non-generalizable with activities in the first level of the tree, including activities like reductant, pore-forming, or ribonucleoprotein239910
Table 1.

Summary of activities classified in the first level of the updated functional biological tree

Biological activityDefinitionRecordsData sources
TherapeuticTherapeutic peptides are molecules, whether designed or natural, with medical applications that aim to treat or alleviate diseases by exerting various bodily functions.57 75051
Cell–cell communicationPeptides involved in cell-to-cell communication, which can influence cell signaling and cell-mediated immune responses.33436
Drug delivery vehicleThese peptides function as carriers that transport and deliver drugs in a specific manner to target cells or tissues. They can target specific receptors on cells to release their therapeutic cargo.285811
TasteA short chain of amino acids that elicits a specific taste sensation when it interacts with taste receptors on the tongue. Taste peptides can evoke a range of tastes, including sweet, bitter, salty, sour, and umami.1922
Signal peptidePeptides that direct proteins to specific cellular locations used in protein secretion and intracellular transport processes, which may affect cellular communication and protein-mediated immune responses11 8359
PropeptideInactive peptides that are converted to active peptides by processing, act as a precursor to a protein, and are removed during processing, which can influence biological activity and protein regulation.14242
ToxicPeptides that possess harmful properties and can induce toxic effects in biological systems. These peptides may interfere with normal cellular functions, alter metabolic processes, or cause damage to cells and tissues.21 90620
Cosmetic and dermatologySmall-chain amino acid peptides are used in skin care and cosmetic products. These peptides are designed to address various skin-related concerns such as wrinkles, fine lines, loss of elasticity, and other signs of skin ageing.3335
Molecular bindingPeptides that bind specifically to molecules, such as proteins, RNA, or DNA, which may influence the regulation of biological processes and immune responses.3143
ImmunologicalPeptides have properties that influence the body#x2019;s immune response and interact with components of the immune system. They can modulate the function of immune cells and play a role in regulating and modulation of the immune response.425811
NeurologicalEssential peptides in nervous system signalling, influencing cellular communication and functions such as memory and neuronal regeneration.10 8865
OtherPeptides with other activities non-generalizable with activities in the first level of the tree, including activities like reductant, pore-forming, or ribonucleoprotein239910
Biological activityDefinitionRecordsData sources
TherapeuticTherapeutic peptides are molecules, whether designed or natural, with medical applications that aim to treat or alleviate diseases by exerting various bodily functions.57 75051
Cell–cell communicationPeptides involved in cell-to-cell communication, which can influence cell signaling and cell-mediated immune responses.33436
Drug delivery vehicleThese peptides function as carriers that transport and deliver drugs in a specific manner to target cells or tissues. They can target specific receptors on cells to release their therapeutic cargo.285811
TasteA short chain of amino acids that elicits a specific taste sensation when it interacts with taste receptors on the tongue. Taste peptides can evoke a range of tastes, including sweet, bitter, salty, sour, and umami.1922
Signal peptidePeptides that direct proteins to specific cellular locations used in protein secretion and intracellular transport processes, which may affect cellular communication and protein-mediated immune responses11 8359
PropeptideInactive peptides that are converted to active peptides by processing, act as a precursor to a protein, and are removed during processing, which can influence biological activity and protein regulation.14242
ToxicPeptides that possess harmful properties and can induce toxic effects in biological systems. These peptides may interfere with normal cellular functions, alter metabolic processes, or cause damage to cells and tissues.21 90620
Cosmetic and dermatologySmall-chain amino acid peptides are used in skin care and cosmetic products. These peptides are designed to address various skin-related concerns such as wrinkles, fine lines, loss of elasticity, and other signs of skin ageing.3335
Molecular bindingPeptides that bind specifically to molecules, such as proteins, RNA, or DNA, which may influence the regulation of biological processes and immune responses.3143
ImmunologicalPeptides have properties that influence the body#x2019;s immune response and interact with components of the immune system. They can modulate the function of immune cells and play a role in regulating and modulation of the immune response.425811
NeurologicalEssential peptides in nervous system signalling, influencing cellular communication and functions such as memory and neuronal regeneration.10 8865
OtherPeptides with other activities non-generalizable with activities in the first level of the tree, including activities like reductant, pore-forming, or ribonucleoprotein239910

In Peptipedia v2.0, only 2399 peptides were categorized as ‘other’. These peptides exhibit biological activity but cannot be classified within the biological activities described in the implemented tree. Additionally, only 1034 peptides—approximately 1% of those registered in this updated version—lack reported activities. This highlights substantial improvements in peptide description and functional understanding compared to the first version reported in Peptipedia [15].

The updated biological activity tree aims to simplify peptide classification by minimizing ambiguity and facilitating more explicit sequence reporting. Integrating more data sources has resulted in a significant increase in sequences with reported biological activities. Peptipedia v2.0 now incorporates over 100 000 sequences with reported biological activities, representing a surge of over 40% from its initial release [15]. This positions Peptipedia v2.0 not only as the largest repository of peptide sequences but also as the most extensive collection of peptides with therapeutic and antimicrobial activities, exceeding traditional databases such as Erop Moskow [51] and SATPDb [52]. Peptipedia v2.0 significantly surpasses traditional peptide databases in both the quantity and comprehensiveness of its records. For instance, databases such as LAMP2 [53], DBAMP [54], and SATPDb [52] collectively document over 20 000 antimicrobial peptides. In contrast, Peptipedia v2.0 alone registers |$\gt$|40 000 antimicrobial peptides, demonstrating its superior coverage. Similarly, when considering specialized repositories for antiviral peptides, databases like AVPiden [55], Antiviral Peptides Database [56], and Dravp [57], each catalogs fewer than 4000 antiviral peptides, whereas Peptipedia v2.0 boasts a record of over 5500 such peptides. The same trend is observed with antifungal peptides: databases such as Ampfun [58] and LAMP2 [53] report fewer than 7000 antifungal peptides, covering only about 65% of the peptides collected in Peptipedia v2.0. This comparison highlights Peptipedia v2.0 as a more extensive and valuable resource for researchers in the peptide field.

Peptipedia incorporates an autonomous updating pipeline

The significant increase in records poses a challenge in maintaining updates, integrating new sequences, and ensuring data integrity. Extract, Transform, and Load systems (ETL) have been designed and implemented to address these database maintenance challenges.

Figure 3 provides an overview of the implemented ETL process for processing a data source. The data source is initially used to extract pertinent information, including peptide sequences, function-associated keywords, literature references, patents, and pharmacological and physicochemical properties. Additionally, metadata is collected and saved for upload into Peptipedia v2.0.

ETL system implemented for processing a data source and loading the collected information into Peptipedia DB. The ETL system implemented in Peptipedia facilitates the process of a new data source to extract and load the information into Peptipedia DB. First, the different information is extracted from the data source, including the peptide sequence, references, keywords, and relevant properties described in the data source. Then, two transformation processes are applied to collect raw data to facilitate the characterization of the peptide sequences and the identification of the functional biological activity. Also, an enrichment analysis is run to obtain gene ontology terms and functional domains. Finally, the load process implies inserting or updating the records in the Peptipedia DB, a statistical analysis, and a summary process of the ETL execution.
Figure 3.

ETL system implemented for processing a data source and loading the collected information into Peptipedia DB. The ETL system implemented in Peptipedia facilitates the process of a new data source to extract and load the information into Peptipedia DB. First, the different information is extracted from the data source, including the peptide sequence, references, keywords, and relevant properties described in the data source. Then, two transformation processes are applied to collect raw data to facilitate the characterization of the peptide sequences and the identification of the functional biological activity. Also, an enrichment analysis is run to obtain gene ontology terms and functional domains. Finally, the load process implies inserting or updating the records in the Peptipedia DB, a statistical analysis, and a summary process of the ETL execution.

After data collection and extraction of relevant information, a redundancy evaluation is conducted to remove duplicate elements. The data are then processed to determine biological activity using the functional biological activity tree implemented in Peptipedia v2.0. Peptides are classified as either canonical or noncanonical based on their sequences. For peptides categorized as canonical, physicochemical and thermodynamical properties are estimated using the ModLAMP library [36].

Next, an enrichment analysis, including Gene Ontology prediction and functional domain analysis through the Pfam tool, is applied to all canonical peptides.

Once the peptide sequences are processed, the loading process begins and the peptide sequences are inserted or updated in Peptipedia v2.0. Subsequently, the metadata, statistics, and summary records in Peptipedia v2.0 are updated, completing the process (see Section S7 of Supplementary data for a database description and a visualization of implemented materialized views in Peptipedia v2.0).

An extraction process was developed for each data source to update Peptipedia v2.0. This implementation is critical due to the unique rules and strategies governing the storage and deployment of information within each data source.

Working with each data source individually simplifies the update process, facilitating individualized execution of record updates within Peptipedia v2.0. This flexibility enables tailored configurations for updates based on the update periods stipulated by the data sources. For instance, databases such as UniProt or PDB undergo monthly updates, whereas databases like SATPdb receive annual updates. Conversely, data sources requiring regular updates are included in the Peptipedia v2.0 update process. Finally, the ETL system can be executed automatically for each employed data source or manually in the case of new databases incorporated into Peptipedia v2.0.

Improving functionalities and implementing new bioinformatics tools

Peptipedia v2.0 introduces different tools to facilitate the study of peptide sequences. Firstly, enrichment analysis methods have been integrated, leveraging the predictive capabilities of Gene Ontology terms [31] and protein functional domains through the Pfam tool [38]. Secondly, the prediction of secondary structures and essential structural properties, such as solvent accessibility, is simplified by integrating the RaptorX-Property tool [32].

Furthermore, new tools and methods have been implemented to facilitate the application of machine learning algorithms to study peptide sequences. Different numerical representation strategies have been incorporated to simplify the preprocessing of peptide sequences. These strategies encompass coding amino acids based on their physicochemical properties and utilizing learning representations derived from pretrained models [47, 59].

The updated machine learning tool now effectively trains predictive models using diverse supervised learning algorithms, including the Gaussian Process, ensemble methods, support vector machines, and nearest neighbour algorithms. Besides, the metrics used to evaluate the training process’s performance have been refined, encompassing performance metrics like Precision, recall, and F1-score for classification tasks and root mean squared error for regression tasks. Both models trained, dataset processed, and summary training process could be downloaded to facilitate using the generated model in a local environment.

Moreover, Peptipedia v2.0 offers pattern recognition capabilities through clustering strategies based on unsupervised learning algorithms, sequence similarity networks, and community detection methods. Peptipedia v2.0 facilitates evaluating the generated groups by applying performance metrics like the Calinski–Harabasz index and the Silhouette coefficients. These algorithms generate cohesive groups characterized using statistical analyses and enrichment tools.

Finally, Peptipedia v2.0 has updated bioinformatic tools for sequence characterization. These include physicochemical property analyses, sequence alignments against the Peptipedia database, and the integration of multiple sequence alignments and statistical description methodologies for groups of protein sequences, which have been implemented to support the study of peptide sequences.

New classification models and predicting nonannotated peptide sequences

The enhancement of the biological activity tree and the expansion of peptide sequences necessitated updating the binary classification models initially developed for Peptipedia. Following the sequence-based approach detailed in the Methods section, this work used 98 biological activities to train binary classification models.

We evaluated 90 combinations of numerical representation strategies and supervised learning algorithms for each biological activity, resulting in over 8 800 trained models. On average, the specificity is |$0.807 \pm 0.084$| and the sensitivity is |$0.813 \pm 0.082$| for the evaluated models (see Section S4 of Supplementary data for more details). Algorithms like ExtraTrees and Random Forest demonstrated the highest sensitivity and specificity, with values exceeding |$0.83 \pm 0.06$|⁠. In contrast, Decision Tree-based models showed the lowest performance, with sensitivity and specificity values below |$0.74 \pm 0.08$|⁠. Regarding numerical representation strategies, models trained using the pretrained models ProTrans t5 xlu50 and ProTrans t5 Uniref exhibited the highest performance, with sensitivity and specificity values above |$0.82 \pm 0.08$|⁠. Conversely, the ProTrans t5 XLNET model showed the lowest performance, with sensitivity and specificity values below |$0.78 \pm 0.08$| (see Section S4 of Supplementary data for more details).

The most effective combinations of numerical representation strategies and supervised learning algorithms involved ProTrans t5 Uniref or ProTrans t5 xlu50 with ExtraTrees, achieving average sensitivity and specificity values exceeding |$0.85 \pm 0.06$|⁠. ProTrans t5 XLNET with Decision Tree was the least effective combination, with average sensitivity and specificity values below |$0.72 \pm 0.09$|⁠.

Based on the sensitivity and specificity performance analysis and the criteria described in the Methodology section, 98 binary classification models were selected from the evaluated combinations. The chosen models achieved average sensitivities and specificities of |$0.877 \pm 0.053$| and |$0.873 \pm 0.054$|⁠, respectively. Combinations such as ProTrans t5 Uniref with ExtraTrees, ProTrans t5 Uniref with Gaussian Process, and ProTrans t5 xlu50 with ExtraTrees were the most frequently selected, representing 25% of all evaluated activities. In contrast, combinations like Esm1B with Adaboost, Esm1B with Random Forest, and ProTrans t5 XLNET with XGBoost were the least frequently selected, representing only 0.9% of the total explored activities (see Section 4 of Supplementary data for more details).

A ranking of the best models (see Table 2) and the models with the lowest performance (see Table 3) based on the Matthews correlation coefficient (MCC) metric was generated. The top models were associated with various target virus activities, including anti-human parainfluenza virus and anti-Andes virus, different virus families like Bunyaviridae and Hantaviridae, and activities such as DNA-binding, neurotoxin, and transit. In contrast, the models with the lowest MCC values were related to activities such as Cosmetic and Dermatology, Blood-Brain Barrier Penetration, and Anticancer. However, despite being classified as the lowest-performing models, their MCC values were higher than 0.5, and they achieved precision averages over 75%, demonstrating high generalization and robustness of the implemented strategies.

Table 2.

Best models based on MCC performances for binary classification of functional biological activities.

ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
Anti-human parainfluenza virusRandom orestProttrans t5bdf0.9990.9990.9990.9990.999
Anti-BunyaviridaeXGBProttrans t5bdf0.9740.9860.9860.9870.986
Anti-sin Nombre virusRandom ForestProttrans t5bdf0.9740.9860.9860.9870.986
DNA-bindingBaggingProttrans t5 uniref0.9720.9860.9860.9860.986
Gluten immunogenic and celiac toxicXGBProttrans xlnet0.9700.9850.9850.9850.985
Anti-HantaviridaeAdaBoostProttrans xlnet0.9620.9800.9800.9810.980
Anti-Andes virusRandom ForestProttrans t5bdf0.9610.9800.9800.9810.980
NeurotoxinGradient BoostingEsm1b0.9480.9740.9740.9740.974
TransitK NeighborsProttrans t5 xlu500.9430.9710.9710.9720.971
CytokineGaussian ProcessProttrans t5 uniref0.9100.9550.9550.9550.955
ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
Anti-human parainfluenza virusRandom orestProttrans t5bdf0.9990.9990.9990.9990.999
Anti-BunyaviridaeXGBProttrans t5bdf0.9740.9860.9860.9870.986
Anti-sin Nombre virusRandom ForestProttrans t5bdf0.9740.9860.9860.9870.986
DNA-bindingBaggingProttrans t5 uniref0.9720.9860.9860.9860.986
Gluten immunogenic and celiac toxicXGBProttrans xlnet0.9700.9850.9850.9850.985
Anti-HantaviridaeAdaBoostProttrans xlnet0.9620.9800.9800.9810.980
Anti-Andes virusRandom ForestProttrans t5bdf0.9610.9800.9800.9810.980
NeurotoxinGradient BoostingEsm1b0.9480.9740.9740.9740.974
TransitK NeighborsProttrans t5 xlu500.9430.9710.9710.9720.971
CytokineGaussian ProcessProttrans t5 uniref0.9100.9550.9550.9550.955
Table 2.

Best models based on MCC performances for binary classification of functional biological activities.

ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
Anti-human parainfluenza virusRandom orestProttrans t5bdf0.9990.9990.9990.9990.999
Anti-BunyaviridaeXGBProttrans t5bdf0.9740.9860.9860.9870.986
Anti-sin Nombre virusRandom ForestProttrans t5bdf0.9740.9860.9860.9870.986
DNA-bindingBaggingProttrans t5 uniref0.9720.9860.9860.9860.986
Gluten immunogenic and celiac toxicXGBProttrans xlnet0.9700.9850.9850.9850.985
Anti-HantaviridaeAdaBoostProttrans xlnet0.9620.9800.9800.9810.980
Anti-Andes virusRandom ForestProttrans t5bdf0.9610.9800.9800.9810.980
NeurotoxinGradient BoostingEsm1b0.9480.9740.9740.9740.974
TransitK NeighborsProttrans t5 xlu500.9430.9710.9710.9720.971
CytokineGaussian ProcessProttrans t5 uniref0.9100.9550.9550.9550.955
ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
Anti-human parainfluenza virusRandom orestProttrans t5bdf0.9990.9990.9990.9990.999
Anti-BunyaviridaeXGBProttrans t5bdf0.9740.9860.9860.9870.986
Anti-sin Nombre virusRandom ForestProttrans t5bdf0.9740.9860.9860.9870.986
DNA-bindingBaggingProttrans t5 uniref0.9720.9860.9860.9860.986
Gluten immunogenic and celiac toxicXGBProttrans xlnet0.9700.9850.9850.9850.985
Anti-HantaviridaeAdaBoostProttrans xlnet0.9620.9800.9800.9810.980
Anti-Andes virusRandom ForestProttrans t5bdf0.9610.9800.9800.9810.980
NeurotoxinGradient BoostingEsm1b0.9480.9740.9740.9740.974
TransitK NeighborsProttrans t5 xlu500.9430.9710.9710.9720.971
CytokineGaussian ProcessProttrans t5 uniref0.9100.9550.9550.9550.955
Table 3.

Models with the lowest MCC performances for binary classification of functional biological activities.

ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
AnticancerK NeighborsProttrans t5 uniref0.6070.8000.7990.8060.800
Anti-RetroviridaeGaussian ProcessProttrans t5 xlu500.6040.8020.8020.8020.802
Anti-methicillin-resistant S. aureusBaggingEsm1b0.6020.8060.8060.8060.806
CytotoxicGaussian ProcessProttrans t5 uniref0.5990.7980.7970.8020.798
Blood brain barrier penetratingGradient BoostingProttrans t5 uniref0.5910.7960.7960.7960.796
Anti-angiogenicExtra TreesProttrans t5 uniref0.5790.7890.7880.7900.789
Anti-biofilmExtra TreesEsm1b0.5780.7880.7880.7890.788
Anti-influenza virusExtra TreesProttrans xlnet0.5760.7900.7890.7900.790
Cosmetic and dermatologyRandom ForestProttrans t5bdf0.5760.7850.7840.7900.785
Anti-CandidaExtra TreesProttrans t5 xlu500.5050.7530.7530.7540.753
ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
AnticancerK NeighborsProttrans t5 uniref0.6070.8000.7990.8060.800
Anti-RetroviridaeGaussian ProcessProttrans t5 xlu500.6040.8020.8020.8020.802
Anti-methicillin-resistant S. aureusBaggingEsm1b0.6020.8060.8060.8060.806
CytotoxicGaussian ProcessProttrans t5 uniref0.5990.7980.7970.8020.798
Blood brain barrier penetratingGradient BoostingProttrans t5 uniref0.5910.7960.7960.7960.796
Anti-angiogenicExtra TreesProttrans t5 uniref0.5790.7890.7880.7900.789
Anti-biofilmExtra TreesEsm1b0.5780.7880.7880.7890.788
Anti-influenza virusExtra TreesProttrans xlnet0.5760.7900.7890.7900.790
Cosmetic and dermatologyRandom ForestProttrans t5bdf0.5760.7850.7840.7900.785
Anti-CandidaExtra TreesProttrans t5 xlu500.5050.7530.7530.7540.753
Table 3.

Models with the lowest MCC performances for binary classification of functional biological activities.

ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
AnticancerK NeighborsProttrans t5 uniref0.6070.8000.7990.8060.800
Anti-RetroviridaeGaussian ProcessProttrans t5 xlu500.6040.8020.8020.8020.802
Anti-methicillin-resistant S. aureusBaggingEsm1b0.6020.8060.8060.8060.806
CytotoxicGaussian ProcessProttrans t5 uniref0.5990.7980.7970.8020.798
Blood brain barrier penetratingGradient BoostingProttrans t5 uniref0.5910.7960.7960.7960.796
Anti-angiogenicExtra TreesProttrans t5 uniref0.5790.7890.7880.7900.789
Anti-biofilmExtra TreesEsm1b0.5780.7880.7880.7890.788
Anti-influenza virusExtra TreesProttrans xlnet0.5760.7900.7890.7900.790
Cosmetic and dermatologyRandom ForestProttrans t5bdf0.5760.7850.7840.7900.785
Anti-CandidaExtra TreesProttrans t5 xlu500.5050.7530.7530.7540.753
ActivityAlgorithmEncoderMCCAccuracyF1-scorePrecisionRecall
AnticancerK NeighborsProttrans t5 uniref0.6070.8000.7990.8060.800
Anti-RetroviridaeGaussian ProcessProttrans t5 xlu500.6040.8020.8020.8020.802
Anti-methicillin-resistant S. aureusBaggingEsm1b0.6020.8060.8060.8060.806
CytotoxicGaussian ProcessProttrans t5 uniref0.5990.7980.7970.8020.798
Blood brain barrier penetratingGradient BoostingProttrans t5 uniref0.5910.7960.7960.7960.796
Anti-angiogenicExtra TreesProttrans t5 uniref0.5790.7890.7880.7900.789
Anti-biofilmExtra TreesEsm1b0.5780.7880.7880.7890.788
Anti-influenza virusExtra TreesProttrans xlnet0.5760.7900.7890.7900.790
Cosmetic and dermatologyRandom ForestProttrans t5bdf0.5760.7850.7840.7900.785
Anti-CandidaExtra TreesProttrans t5 xlu500.5050.7530.7530.7540.753

Finally, the trained classification models evaluated 3.8 million sequences without reported biological activity incorporated into Peptipedia v2.0. Table 4 summarizes the classification results for the biological activities at the first level of the updated biological activity tree. Except for peptides with Molecular Binding activity, all biological activities saw an increase in records by more than 100%. Notably, categories such as therapeutics, signal peptides, propeptides, immunological, and neurological activities increased by more than 600 000 peptides. This suggests a tendency for immunological activity and propeptides among previously uncharacterized peptides. Additionally, more than 800 000 peptides showed potential therapeutic activity, indicating that they could serve as alternatives to traditional drug-based compounds. However, further evaluation of these peptides is necessary, including toxicity and immunogenicity assessments and structural analyses to understand their properties compared to previously validated therapeutic peptides [7].

Table 4.

Summary classified unknown peptides used the trained classification models

#Activity (first level in tree)Labelled peptidesPredicted peptides
1Therapeutic57 750837 036
2Cell–cell communication33436126
3Drug delivery vehicle258517 984
4Taste19297 405
5Signal peptide11 835851 175
6Propeptide14251 215 114
7Toxic21 906112 584
8Cosmetic and dermatology33374 956
9Molecular binding314196
10Immunological45281 001 216
11Neurological10 886715 027
#Activity (first level in tree)Labelled peptidesPredicted peptides
1Therapeutic57 750837 036
2Cell–cell communication33436126
3Drug delivery vehicle258517 984
4Taste19297 405
5Signal peptide11 835851 175
6Propeptide14251 215 114
7Toxic21 906112 584
8Cosmetic and dermatology33374 956
9Molecular binding314196
10Immunological45281 001 216
11Neurological10 886715 027
Table 4.

Summary classified unknown peptides used the trained classification models

#Activity (first level in tree)Labelled peptidesPredicted peptides
1Therapeutic57 750837 036
2Cell–cell communication33436126
3Drug delivery vehicle258517 984
4Taste19297 405
5Signal peptide11 835851 175
6Propeptide14251 215 114
7Toxic21 906112 584
8Cosmetic and dermatology33374 956
9Molecular binding314196
10Immunological45281 001 216
11Neurological10 886715 027
#Activity (first level in tree)Labelled peptidesPredicted peptides
1Therapeutic57 750837 036
2Cell–cell communication33436126
3Drug delivery vehicle258517 984
4Taste19297 405
5Signal peptide11 835851 175
6Propeptide14251 215 114
7Toxic21 906112 584
8Cosmetic and dermatology33374 956
9Molecular binding314196
10Immunological45281 001 216
11Neurological10 886715 027

A new and modern face for the Peptipedia web platform

In this new version of Peptipedia, a visual restructuring of the web platform was designed and implemented. Using web development technologies based on React, a new front-end was developed. The new Peptipedia platform separates the tools for visualizing the database and their respective data searching and downloading options, including Fasta and Comma separated value format files. The separation of tools and databases allows for optimizing response times for each user-generated action (See Figure 4 for a schematic representation and Section S5 of Supplementary data for more details).

Homepage for Peptipedia DB and Peptitools, the two main services implemented into Peptipedia. (a) Homepage generated to access on Peptipedia DB. This homepage has (i) the full menu with all functionalities available on the system, (ii) a schematic representation of the implemented workflow to process collected data from data sources, (iii) a statistical description of register information, and (iv) relevant information on the project. (b) Homepage generated to access on PeptiTools. This homepage includes a schematic representation of implemented tools and a full menu to access a desirable tool.
Figure 4.

Homepage for Peptipedia DB and Peptitools, the two main services implemented into Peptipedia. (a) Homepage generated to access on Peptipedia DB. This homepage has (i) the full menu with all functionalities available on the system, (ii) a schematic representation of the implemented workflow to process collected data from data sources, (iii) a statistical description of register information, and (iv) relevant information on the project. (b) Homepage generated to access on PeptiTools. This homepage includes a schematic representation of implemented tools and a full menu to access a desirable tool.

Upon accessing the Peptipedia web platform via its access link, the user will first see the Peptipedia homepage (See Figure 4a). This homepage features information about the database, a statistical summary of the information recorded in Peptipedia, a diagram illustrating how data are collected and processed, and information describing the workgroup and the project (see more details in Section S5 of Supplementary data for more information).

The Peptipedia menu includes the main accesses to the platform, including activity analysis, data sources, direct download systems, and the search engine. The latest version of the sequence search engine in Peptipedia facilitates the application of different filters to customize the type of information identification. Query results have also been optimized through the generation of materialized views. The visualization of the results has been updated to present the characteristics of the identified sequences in a more user-friendly manner, incorporating all existing information in the platform (see Section S5 of Supplementary data for more details).

From the perspective of the computational tools implemented in Peptipedia, a new platform called PeptiTools has been developed (see Figure 4b). This platform contains all the tools included in the system, including bioinformatics analysis and enrichment tools, sequence characterization, predictions of biological activities via classification model binaries trained based on the biological activities updated in this new version of Peptipedia, machine learning application methods for building predictive models, and pattern identification methods based on peptide sequence clustering techniques and unsupervised learning algorithms.

Among the new tools incorporated in this latest version of Peptipedia, machine learning methods play an essential role in predicting and studying unknown peptide sequences and training predictive models or pattern recognition.

Only amino acid sequences are required when using classification models of biological activities, and the system will automatically predict the selected biological activities. To do this, the system collects the sequences, numerically represents them according to the pretrained model corresponding to the activity of interest, applies the model, and delivers the responses. A configurable probability threshold parameter is incorporated to customize predictions and provide greater decision control to the user.

The training of predictive models can be carried out using the corresponding tool implemented in PeptiTools (see Section S5 of Supplementary data for more details). These models are based on numerical representation of sequences, so they do not employ structural techniques or feature engineering, only allowing the application of pretrained models and encodings based on physicochemical properties. Besides, different configurations for preprocessing and training models are available, including (i) selecting the algorithm to train the model, (ii) validation strategies, and (iii) standardization and dimensionality reduction applications. Once the model is trained, the results are displayed on the platform, which varies depending on the type of response used to train the model, indicating specific information such as scatter plots for regression models and sensitivity–specificity analysis with confusion matrices for classification models.

When applying clustering strategies for pattern recognition, the tool requires input sequences, the selection of a numerical representation strategy, and the choice of methodology. The tool processes the entered configuration, applies the approach, and generates the results, displaying the characterized groups and a summary of the process (see section S5 of Supplementary data for more details).

How to use Peptipedia v2.0? Examples and advice to improve the user-friendly experience

Peptipedia v2.0 stands out as a comprehensive resource for the search and analysis of peptide sequences with reported biological activities in the literature. Beyond serving as a powerful search site, this platform facilitates the description of peptides from phylogenetic, structural, descriptive, and functional perspectives, making it an integrated system for both searching and characterizing sequences.

Firstly, Peptipedia v2.0 optimizes the search for peptide sequences by integrating various data sources and features, allowing users to apply different filters, explore results, and download them in an accessible format. The ability to combine filters and narrow the search range enables users to refine their queries based on specific descriptors, such as sequence length or isoelectric point. Peptipedia v2.0 also facilitates searches by filtering for canonical residues and biological activities, both reported and predicted. It is worth noting that including sequences with predicted biological activities increases the number of results (see Figure S6 in Section S5 of the Supplementary data for more details). However, it is important to consider the potential margin of error associated with predictive models.

Secondly, Peptipedia v2.0 allows for a detailed study of peptide sequences through the application of various bioinformatics and predictive tools. For example, the platform facilitates the phylogenetic evaluation of sequences, as well as the prediction of Gene Ontology terms and Pfam domains. Additionally, it offers tools for predicting structural properties and estimating physicochemical properties, enabling a complete characterization of sequences of interest. By utilizing predictive models, Peptipedia v2.0 also allows for the in silico identification of peptides with potential biological activities, making it useful in the characterization of unknown peptides obtained from sequencing processes, to predict both biological activities and physicochemical properties.

A third application is related to the use of the machine learning tools available in Peptipedia v2.0. These tools allow the application of clustering methods to identify patterns and the use of predictive modelling strategies to train classification systems or predict desired properties by the user. An example would be training an antiviral peptide classification model specific to viruses of the Retroviridae family. The user can select the type of numerical representation, the algorithm to apply, and the validation strategy. Peptipedia v2.0 handles the model training, generating a performance report and providing a section to facilitate the use of the model in predicting new sequences.

By bringing together various tools in a single web platform, Peptipedia v2.0 not only allows for the search and identification of peptides reported in the literature but also provides simplified access to powerful bioinformatics tools without the need for complex installations on a personal computer. In this way, the challenges associated with the installation and execution of tools, as well as user requirements, software components, and operating system compatibility, are eliminated, as everything is integrated into an accessible and efficient web platform for the study of peptides and their biological activities.

Conclusions

This work introduces Peptipedia v2.0, a significant update to our previously reported peptide database. Peptipedia v2.0 integrates data from over 70 sources, compiling over 100 000 peptides with reported biological activity and over 3.8 million peptide sequences without reported biological activity. The functional biological activity tree has been updated, increasing the number of biological activities and refining the hierarchical structure to facilitate the efficient study of peptide functions.

In this new version of Peptipedia, all registered peptides are characterized using enrichment analysis approaches, integrating gene ontology term predictions, secondary structure evaluation, and functional domain analysis. Additionally, various physicochemical and thermodynamic properties are estimated for each peptide sequence, enhancing the knowledge base of Peptipedia v2.0. The database also includes patents, references, and pharmacological properties related to the peptide sequences.

The classification models have been enhanced with the updated functional biological activity tree. A combination of numerical representation strategies based on embedding approaches through pretrained models and classical supervised learning algorithms was used to train the classification models, achieving an average sensitivity and specificity of 87% across more than 95 evaluated models. These models were utilized to identify potential therapeutic peptides and predict the functional biological activity of the 3.8 million previously uncharacterized peptide sequences.

Furthermore, Peptipedia v2.0 incorporates new services and tools, including enrichment analysis, statistical evaluation, and machine learning–based strategies, to facilitate the study of peptide sequences. By updating the database and developing these tools, Peptipedia v2.0 positions itself as one of the most comprehensive public repositories for peptide research, playing a crucial role in various research areas.

Acknowledgement

D.M.-O. acknowledges ANID for the project “SUBVENCIÓN A INSTALACIÓN EN LA ACADEMIA CONVOCATORIA AÑO 2022”, Folio 85 220 004. D.M.-O., R.U.-P., and A.D. gratefully acknowledge support from the Centre for Biotechnology and Bioengineering (PIA project FB0001, Conicyt, Chile), and MAN acknowledges ANID for grants Anillo ATE220016. MAN and R.U.-P. acknowledge ANID for grant Fondecyt 1 230 298. M.D.D. acknowledges funding by the Deutsche Forschungsgemeinschaft (German Research Foundation)—SPP2363. PEACCEL was supported through a research program partially cofunded by the European Union (UE) and Region Reunion (FEDER).

Author contributions

Gabriel Cabas-Mora and David Medina-Ortiz (Conceptualization), David Medina-Ortiz, Diego Alvarez, and Gabriel Cabas-Mora (Methodology), David Medina-Ortiz, Anamaría Daza, Marcelo Navarrete (Validation), Gabriel Cabas-Mora, Lindybeth Sarmiento-Varón, Nicole Soto-García, Valentina Garrido, and Anamaría Daza, (Investigation), David Medina-Ortiz, Anamaría Daza, Julieta H. Sepúlveda Yañez, Gabriel Cabas-Mora, Marcelo Navarrete, Roberto Uribe-Paredes, Alvaro Olivera-Nappa, Frederic Cadet, Mehdi D. Davari (Writing—review and editing), David Medina-Ortiz, Roberto Uribe-Paredes (Supervision, funding resources), and David Medina-Ortiz, Roberto Uribe-Paredes (Project administration).

Supplementary data

Supplementary data is available at Database online.

Conflict of interest

The authors declare that the research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.

Data availability

Source code and datasets are available on the GitHub repository: (i) Peptipedia v2.0 database https://github.com/ProteinEngineering-PESB2/Peptipedia and (ii)Peptipedia v2.0 tools: https://github.com/ProteinEngineering-PESB2/Peptitools. The ETL process scripts are located in https://github.com/ProteinEngineering-PESB2/PeptipediaParser. The user-friendly web platform is publicly accessible through https://app.peptipedia.cl/ for non-commercial uses, licensed under a Creative Commons CC BY-NC-ND 4.0 license. The Peptipedia v2.0 database is available for non-commercial use and is licensed under an ODbl license. The SQL dump file, the peptide sequences with the biological activities, and the peptide sequences with the predicted biological activities applying the trained predictive models are available on https://drive.google.com/drive/folders/1IDNhWmROMfdpgj6ADunBgVb0YBBBaJui?usp=drive_link.

References

1.

Lau
JL
,
Dunn
MK
.
Therapeutic peptides: Historical perspectives, current development trends, and future directions
.
Bioorganic & Medicinal chemistry
2018
;
26
:
2700
07

2.

Lien
S
,
Lowman
HB
.
Therapeutic peptides
.
Trends in biotechnology
2003
;
21
:
556
62
.

3.

Wang
L
,
Wang
N
,
Zhang
W
et al. .
Therapeutic peptides: current applications and future directions
.
Signal Transduction and Targeted Therapy
2022
;
7
:48.

4.

Taylor
SII
.
Rational design of peptide agonists of cell-surface receptors
.
Trends in Pharmacological Sciences
2000
;
21
:
9
10
.

5.

Muzamil Khan
M
,
Filipczak
N
,
Torchilin
VP
.
Cell penetrating peptides: a versatile vector for co-delivery of drug and genes in cancer
.
Journal of Controlled Release
2021
;
330
:
1220
28
.

6.

Apostolopoulos
V
,
Bojarska
J
,
Chai
T-T
et al. .
A global review on short peptides: Frontiers and perspectives
.
Molecules
2021
;
26
:1.

7.

Goles
M
,
Daza
A
,
Cabas-Mora
G
et al. .
Peptide-based drug discovery through artificial intelligence: towards an autonomous design of therapeutic peptides
.
Briefings in Bioinformatics
2024
;
25
:bbae275.

8.

Sims
EK
,
Carr
ALJ
,
Oram
RA
et al. .
100 years of insulin: celebrating the past, present and future of diabetes therapy
.
Nature medicine
2021
;
27
:
1154
64
.

9.

Goeddel
DV
,
Kleid
DG
,
Bolivar
F
et al. .
Expression in escherichia coli of chemically synthesized genes for human insulin
.
Proceedings of the National Academy of Sciences
,
1979
;
76
:
106
10
.

10.

Henninot
A
,
Collins
JC
,
Nuss
JM
.
The current state of peptide drug discovery: back to the future?
.
Journal of Medicinal Chemistry
2018
;
61
:
1382
414
.

11.

Chi-Lung Lee
A
,
Louise Harris
J
,
Kum Khanna
K
et al. .
A comprehensive review on current advances in peptide drug development and design
.
International Journal of Molecular sciences
2019
;
20
:2383.

12.

GVR Report Cover
. Peptide therapeutics market analysis, 2018-2030 – base year – 2022.
Electronic (PDF)
,
2023
.
Report ID: 978-1-68038-179-5
,
Number of pages: 110, Historical Range: 2018–2021
,
Industry
:
Healthcare
.

13.

Muttenthaler
M
,
King
GF
,
Adams
DJ
et al. .
Trends in peptide drug discovery
.
Nature Reviews Drug discovery
2021
;
20
:
309
25
.

14.

Wan
F
,
Kontogiorgos-Heintz
D
,
de la Fuente-Nunez
C
.
Deep generative models for peptide design
.
Digital Discovery
2022
;
1
:
195
208
.

15.

Quiroz
C
,
Barrera Saavedra
Y
,
Armijo-Galdames
B
et al. .
Peptipedia: a user-friendly web application and a comprehensive database for peptide research supported by machine learning approach
.
Database
2021
;
2021
:baab055.

16.

Pirtskhalava
M
,
Amstrong
AA
,
Grigolava
M
et al. .
Dbaasp v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics
.
Nucleic Acids Research
2020
.

17.

Zamyatnin
AA
.
Erop-moscow: specialized data bank for endogenous regulatory oligopeptides
.
Protein Sequences & Data analysis
1991
;
4
:
49
52
.

18.

Zhao
X
,
Hongyu
W
,
Hairong
L
et al. .
Lamp: a database linking antimicrobial peptides
.
PLoS One
2013
;
8
:e66557.

19.

Kang
X
,
Dong
F
,
Shi
C
et al. .
Dramp 2.0, an updated data repository of antimicrobial peptides
.
Scientific Data
2019
;
6
:
1
10
.

20.

Singh
S
,
Chaudhary
K
,
Kumar Dhanda
S
et al. .
Satpdb: a database of structurally annotated therapeutic peptides
.
Nucleic Acids research
2016
;
44
:
D1119
26
.

21.

Van Dorpe
S
,
Bronselaer
A
,
Nielandt
J
et al. .
Brainpeps: the blood–brain barrier peptide database
.
Brain Structure and Function
2012
;
217
:
687
718
.

22.

Wynendaele
E
,
Bronselaer
A
,
Nielandt
J
et al. .
Quorumpeps database: chemical space, microbial origin and functionality of quorum sensing peptides
.
Nucleic Acids Research
2013
;
41
:
D655
59
.

23.

Singam Ettayapuram Ramaprasad
A
,
Singh
S
,
Venkatesan
S
et al. .
Antiangiopred: a server for prediction of anti-angiogenic peptides
.
PLoS One
2015
;
10
:e0136990.

24.

Hammami
R
,
Zouhir
A
,
Le Lay
C
et al. .
Bactibase second release: a database and tool platform for bacteriocin characterization
.
Bmc Microbiology
2010
;
10
:
1
5
.

25.

The UniProt Consortium
.
Uniprot: the universal protein knowledgebase in 2023
.
Nucleic Acids Research
2023
;
51
:
D523
31
.

26.

Jeffery
CJ
.
Moonlighting proteins
.
Trends in Biochemical Sciences
1999
;
24
:
8
11
.

27.

Singh
N
,
Bhalla
N
.
Moonlighting proteins
.
Annual Review of Genetics
2020
;
54
:
265
85
.

28.

Zanzoni
A
,
Ribeiro
DM
,
Brun
C
.
Understanding protein multifunctionality: from short linear motifs to cellular functions
.
Cellular and Molecular Life Sciences
2019
;
76
:
4407
12
.

29.

Mistry
J
,
Chuguransky
S
,
Williams
L
et al. .
Pfam: the protein families database in 2021
.
Nucleic Acids research
2021
;
49
:
D412
19
.

30.

Ashburner
M
,
Ball
CA
,
Blake
JA
et al. .
Gene ontology: tool for the unification of biology
.
Nature genetics
2000
;
25
:
25
9
.

31.

Aleksander
SA
,
Balhoff
J
,
Seth Carbon
JMC
et al. .
The gene ontology knowledgebase in 2023
.
Genetics
2023
;
224
:iyad031.

32.

Wang
S
,
Wei
L
,
Liu
S
et al. .
RaptorX-Property: a web server for protein structure property prediction
.
Nucleic Acids Research
2016
;
44
:
W430
35
.

33.

Berman
HM
,
Westbrook
J
,
Feng
Z
et al. .
The protein data bank
.
Nucleic Acids research
2000
;
28
:
235
42
.

34.

Jumper
J
,
Evans
R
,
Pritzel
A
et al. .
Highly accurate protein structure prediction with alphafold
.
Nature
2021
;
596
:
583
89
.

35.

Berman
HM
,
Westbrook
J
,
Feng
Z
et al. .
The Protein Data Bank
.
Nucleic Acids Research
2000
;
28
:
235
42
.

36.

Müller
AT
,
Gabernet
G
,
Hiss
JA
et al. .
modlAMP: Python for antimicrobial peptides
.
Bioinformatics
2017
;
33
:
2753
55
.

37.

Hamp
T
,
Kassner
R
,
Seemayer
S
et al. .
Homology-based inference sets the bar high for protein function prediction
.
BMC Bioinformatics
2013
;
14
:S7.

38.

Mistry
J
,
Chuguransky
S
,
Williams
L
et al. .
Pfam: the protein families database in 2021
.
Nucleic Acids Research
2020
;
49
:
D412
19
.

39.

Medina-Ortiz
D
,
Cabas-Mora
G
,
Moya-Barria
I
et al. .
Rudeus, a machine learning classification system to study DNA-binding proteins
.
bioRxiv
2024
;
2024–02
.

40.

Medina-Ortiz
D
,
Contreras
S
,
Fernández
D
et al. .
Protein language models and machine learning facilitate the identification of antimicrobial peptides
.
International Journal of Molecular Sciences
2024
;
25
:8851.

41.

Pinacho-Castellanos
SA
,
García-Jacas
CR
,
Gilson
MK
et al. .
Alignment-free antimicrobial peptide predictors: improving performance by a thorough analysis of the largest available data set
.
Journal of Chemical Information and Modeling
2021
;
61
:
3141
57
.

42.

Hongwu
L
,
Yan
K
,
Liu
B
.
Tppred-le: therapeutic peptide function prediction based on label embedding
.
BMC biology
2023
;
21
:238.

43.

Lee
H
,
Lee
S
.
Ingoo Lee, and Hojung Nam. Amp-bert: Prediction of antimicrobial peptide function based on a Bert model
.
Protein Science
2023
;
32
:e4529.

44.

Chenkai
L
,
Warren
RL
,
Birol
I
.
Models and data of amplify: a deep learning tool for antimicrobial peptide prediction
.
BMC Research Notes
2023
;
16
:11.

45.

Limin
F
,
Niu
B
,
Zhu
Z
et al. .
Cd-hit: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
2012
;
28
:
3150
52
.

46.

Veltri
D
,
Kamath
U
,
Shehu
A
.
Deep learning improves antimicrobial peptide recognition
.
Bioinformatics
2018
;
34
:
2740
47
.

47.

Dallago
C
,
Schütze
K
,
Heinzinger
M
et al. .
Learned embeddings from deep learning to visualize and predict protein sets
.
Current Protocols
2021
;
1
:e113.

48.

Biswas
S
,
Khimulya
G
,
Alley
EC
et al. .
Low-n protein engineering with data-efficient deep learning
.
Nature methods
2021
;
18
:
389
96
.

49.

Medina-Ortiz
D
,
Contreras
S
,
Amado-Hinojosa
J
et al. .
Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins
.
arXiv preprint arXiv:2010.03516
2020

50.

Medina-Ortiz
D
,
Contreras
S
,
Quiroz
C
et al. .
Dmakit: a user-friendly web platform for bringing state-of-the-art data analysis techniques to non-specific users
.
Information Systems
2020
; 101557.

51.

Zamyatnin
AA
.
The EROP-moscow oligopeptide database
.
Nucleic Acids Research
2006
;
34
:
D261
66
.

52.

Singh
S
,
Chaudhary
K
,
Kumar Dhanda
S
et al. .
SATPdb: a database of structurally annotated therapeutic peptides
.
Nucleic Acids Research
2015
;
44
:
D1119
26
,
November
.

53.

Guizi
Y
,
Hongyu
W
,
Huang
J
et al. .
LAMP2: a major update of the database linking antimicrobial peptides
.
Database
2020
;
2020
,
January
.

54.

Jhong
J-H
,
Yao
L
,
Pang
Y
et al. .
dbAMP 2.0: updated resource for antimicrobial peptides with an enhanced scanning method for genomic and proteomic data
.
Nucleic Acids Research
2021
;
50
:
D460
70
,
November
.

55.

Pang
Y
,
Yao
L
,
Jhong
J-H
et al. .
Avpiden: a new scheme for identification and functional prediction of antiviral peptides based on machine learning approaches
.
Briefings in Bioinformatics
2021
;
22
:bbab263.

56.

Qureshi
A
,
Thakur
N
,
Tandon
H
et al. .
Avpdb: a database of experimentally validated antiviral peptides targeting medically important viruses
.
Nucleic Acids Research
2014
;
42
:
D1147
53
.

57.

Liu
Y
,
Zhu
Y
,
Sun
X
et al. .
Dravp: a comprehensive database of antiviral peptides and proteins
.
Viruses
2023
;
15
:820.

58.

Chung
C-R
,
Kuo
T-R
,
Li-Ching
W
et al. .
Characterization and identification of antimicrobial peptides with different functional activities
.
Briefings in Bioinformatics
2019
;
21
:
1098
1114
.

59.

Medina-Ortiz
D
,
Contreras
S
,
Amado-Hinojosa
J
et al. .
Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering
.
Frontiers in Molecular Biosciences
2022
;
9
,
July
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data