Abstract

Natural products play a pivotal role in drug discovery, and the richness of natural products, albeit significantly influenced by various environmental factors, is predominantly determined by intrinsic genetics of a series of enzymatic reactions and produced as secondary metabolites of organisms. Heretofore, few natural product-related databases take the chemical content into consideration as a prominent property. To gain unique insights into the quantitative diversity of natural products, we have developed the first TerPenoids database embedded with Content information (TPCN) with features such as compound browsing, structural search, scaffold analysis, similarity analysis and data download. This database can be accessed through a web-based computational toolkit available at http://www.tpcn.pro/. By conducting meticulous manual searches and analyzing over 10 000 reference papers, the TPCN database has successfully integrated 6383 terpenoids obtained from 1254 distinct plant species. The database encompasses exhaustive details including isolation parts, comprehensive molecule structures, chemical abstracts service registry number (CAS number) and 7508 content descriptions. The TPCN database accentuates both the qualitative and quantitative dimensions as invaluable phenotypic characteristics of natural products that have undergone genetic evolution. By acting as an indispensable criterion, the TPCN database facilitates the discovery of drug alternatives with high content and the selection of high-yield medicinal plant species or phylogenetic alternatives, thereby fostering sustainable, cost-effective and environmentally friendly drug discovery in pharmaceutical farming.

Database URL: http://www.tpcn.pro/

Introduction

Natural products play a vital role as a source of innovative drugs according to numerous scientific studies (1–3). Terpenoids are the most abundant class of natural products, from hemiterpenes and monoterpenes with very low molecular weight (MW) to triterpenes and tetraterpenes with largeMWs, exhibiting linear, planar molecules to complex three-dimensional bridge and ring structures (4, 5). Compared with other natural products such as flavonoids and phenylpropanoids, the structure, quantity and activity of terpenes are more diverse (5, 6). Terpenoids play an important role in drug discovery and pharmaceutical fields due to their enormous structural and physicochemical diversities.

Nearly 76% of terpenoids are derived from plants (7), and they are one of the main groups of bioactive compounds in medicinal plants (6). Potent and active terpenoids such as andrographolide (bacillary dysentery) (8), paclitaxel (anticancerous) (9), artemisinin (antimalarial) (10), triptolide (anti-inflammatory) (11), asiaticoside (vulnerary) (12) and paeoniflorin (anti-inflammatory) (13) are derived from plants. Presently, ∼70 000 different plants are used by traditional and modern medical systems worldwide (14). In China, around 329 species of medicinal herbs are cultivated on >5.56 million hectares (15). According to the World Health Organization, the current global market value for medicinal plants stands at $14 billion per annum and will exceed to $5 trillion by 2050 (16). These estimations highlight the substantial growth projected in the demand for medicinal plants over the coming decades.

The content of natural products is affected by various factors. Recently, research found that the content of the same natural product can vary under different conditions, such as molecular regulation, species factor, environmental condition and combined factors (17–19). In the evolutionary formation of natural products, there are characteristics of convergent evolution (20). Unrelated plants may evolve the same natural products or different compounds with functionally similar properties, due to environmental stress involving genetic changes (21–23). Taking advantage of the considerable difference in the abundances of structurally similar compounds, a low-content natural product can be substituted with another compound that shares similar functions but can be obtained in higher yields from the original plant (24).

In order to align with the accelerated pace of contemporary drug discovery and meet the growing demand for pharmaceutical raw materials, experts in natural product science must consistently enhance both the quality and quantity of active compounds (25). The composition and concentration values of natural products facilitate the selection of active compounds of high content, and they determine the quality of herbal medicines, which would help to select high-yield germplasm resources (26). Several terpenoid databases have been reported with chemical structures, biological sources, bioactivity and terpene synthases, but few databases described the data on terpenoids content variations specifically (7, 27, 28). In this study, we summarize the species, content and tissue origin of terpenes isolated from plants between 1946 and 2022, including many active compounds, and establish a content-embedded database of terpenoids (TPCN), which is accessible through a web-based computational toolkit available at http://www.tpcn.pro/. The TPCN included the yield of secondary metabolites, the key target phenotypic trait of medicinal plants, as an important reference basis to facilitate the discovery of drug alternatives with enhanced content for higher druggability and assist in the screening of high-quality medicinal plant lines or identifying new alternative lines.

Materials and methods

Data sources

All data in the TPCN were extracted from the literature and various online database resources (Figure 1). The Web of Science was searched using keyword combinations like ‘terpenoids’, ‘monoterpenes’, ‘diterpenes’, ‘sesquiterpenes’ and ‘triterpenes’ to collect literature on the content of terpenoids from 1946 to 2022. Then, the content information of terpenoids was recorded and input into the database manually, including chemical names, biological sources (family and species), extraction parts, content values and literature sources. The structure of terpenoids was extracted from SciFinder and standardized using RDKit (29). To ensure the accuracy of the structures, we compared the structures from literature and SciFinder, recording solely the matching structural information.

Atl text: MySQL serves as the repository for the data. The website development utilizes the Python-based Django framework and incorporates functionalities such as browsing, searching, analyzing, and downloading compounds.
Figure 1.

The architecture of TPCN.

Data distribution

Relying on manual collection and sorting of literature data, the distribution of terpenoids from different perspectives was analyzed, including structural type, biological source, extraction part and content. These terpenoids consisted of four categories: monoterpenoids, sesquiterpenoids, diterpenoids and triterpenoids, with the respective counts calculated for each category. A more comprehensive analysis of the distribution of terpenoids from various biological sources (family and species) and extraction parts was also conducted. In addition, to conduct a thorough analysis of the content distribution of terpenoids, we segmented the content into seven ranges: 10–6% to 10–5%, 10–5% to 10–4%, 10–4% to 10–3%, 10–3% to 10–2%, 10–2% to 10–1%, 10–1% to 1% and 1% to 10%. The number of terpenoids in each content range was counted. It is noteworthy that the classification information of terpenoids was initially extracted from the literature and then incorporated into the database. When extracting 1 g of terpenoids from 1 kg of raw material, the content is expressed as 0.1% (i.e. 10–1%).

Similarity calculation

Similarity measure comprises three essential components: molecular representation, weighting scheme and similarity coefficient (30). The Tanimoto coefficient is extensively utilized in chemoinformatics and computational medicinal chemistry owing to its computational simplicity and rapid processing speed. Nonetheless, it also demonstrates a certain level of reliance on the sizes of the molecules, resulting in reduced similarity values particularly when searching for small reference structures (where only a few bits are activated in the reference structure’s fingerprint) (31). The Dice coefficient is also extensively utilized to measure molecular similarity due to its simplicity in calculations, yet it is comparatively slower computationally compared to the Tanimoto coefficient (32). The Cosine coefficient is frequently employed to gauge similarity between sparse data and can rapidly calculate the average similarity between all pairs of compounds in the datasets (33). Within the similarity search interface of TPCN, four molecular fingerprints (Daylight fingerprint, ECFP4, ECFP6 and MACCS) and three similarity indices (Tanimoto/Jaccard coefficient, Dice coefficient and Cosine coefficient) could be selected to calculate the similarity between molecules using RDKit. Unless otherwise specified, daylight fingerprint and Tanimoto coefficient were applied to calculate the similarity between terpenoids. Additionally, the content differences of terpenoids with structural similarity over 0.95 were also further calculated. Notably, in the case of terpenoids with multiple content values, the content difference was calculated by the maximum content value for each terpenoid.

Murcko scaffold analysis

Murcko scaffold is the core structure of a molecule that is composed of the ring systems and the linkers between them. Double bonds directly attached to the ring systems, or linkers are retained (Supplementary Figure S1) (34). To further explore the relationship between terpenoid scaffolds and their content, the dominant Murcko scaffolds (ordered by the occurrence frequency) of terpenoids from different content ranges were generated by RDKit. Initially, we categorized terpenoids into seven groups based on their content ranges. Subsequently, we generated the Murcko scaffolds for each group of terpenoids and recorded the occurrence frequency of each scaffold. Lastly, the top 10 dominant Murcko scaffolds for each content range were displayed. Besides, to explore the relationship between the content of terpenoids and their glycosylation levels, the glycosylation ratio of terpenoids from different content ranges was counted. Sugar Removal Utility (SRU), a tool for deglycosylation, was used for the identification and removal of sugar moieties of terpenoids (35). The parameters of the SRU were set as follows. The circular and linear sugar moieties, as well as non-terminal and terminal sugar moieties, were all removed. The fragments with fewer than five heavy atoms that got disconnected from the molecule after the removal of sugar moieties were removed. The minimum ratio of circular sugar between its exocyclic oxygen atoms and the atoms within the sugar ring was set to 0.4. Other parameters were set as the default values (36).

Physicochemical property calculation

To explore the differences in the physicochemical properties of terpenoids from different content ranges, we calculated 11 physicochemical properties of terpenoids by RDKit. These physicochemical properties are MW, hydrogen bond acceptor (HBA), hydrogen bond donor (HBD), octanol–water partition coefficient (AlogP), topological polar surface area (TPSA), number of rotatable bonds (NumRotatableBonds) (conjugated single bonds were not considered), number of heavy atoms (NumHeavyAtoms), number of aromatic rings (NumAromaticRings), number of aliphatic rings (NumAliphaticRings), number of rings (NumRings) and fraction of Csp3 atoms (FractionCsp3). The average value of the physicochemical properties of terpenoids from different content ranges was calculated.

Results and discussion

Analysis of the content of terpenoids with species and tissue sources

TPCN is the first content-embedded database of terpenoids, comprising 7508 content records for 6383 unique terpenoids. Based on the structural characteristics of terpenoids, the terpenoids in the TPCN database were classified into four categories: monoterpenoids, sesquiterpenoids, diterpenoids and triterpenoids. Triterpenoids (2856, 44.74%) are the predominant category of terpenoids in the TPCN database, followed by diterpenoids (1351, 21.17%), sesquiterpenoids (1346, 21.09%) and monoterpenoids (830, 13.00%) (Figure 2A).

Alt text: The statistics of terpenoids in the TPCN database based on (A) structural type, (B) family, (C) extraction part and (D) content range.
Figure 2.

Distribution of terpenoids in the TPCN database based on (A) structural type, (B) family, (C) extraction part and (D) content range.

Plants are an extraordinary source of bioactive molecules, and ∼76% of terpenoids are derived from plants (7). In the TPCN database, these terpenoids were derived from 1254 species belonging to 156 different plant families (Figure 3). Asteraceae (979, 15.34%) was the main source of terpenoids, followed by Ranunculaceae (620, 9.71%), Fabaceae (382, 5.98%) and Lamiaceae (342, 5.36%) (Figure 2B). Consistent with previous studies, terpenoids are one of the most important components of constituents among the secondary metabolites identified from Asteraceae (37).

Alt text: The bar chart on the outside corresponds to the number of compounds isolated from each species.
Figure 3.

Schematic of the distribution of terpenoids across plant phylogeny.

Many terpenoids show specificity to certain species and tissues, indicating species-specific and tissues-specific functions (38–41). In addition, we also divided the extraction parts of these terpenoids into 17 types. According to the statistical results, the number of terpenoids from the root (1534, 24.03%) was the largest, followed by aerial parts (1306, 20.46%), whole plants (1100, 17.23%) and leaves (870, 13.63%). However, terpenoids derived from seedlings (3, 0.05%) and knots (3, 0.05%) were infrequently found (Figure 2C). In general, the root is the most predominant accumulation organ of terpenoids.

The production of natural products is important for functional research and commercial development (42). But, for the vast majority of natural products, it is the most important factor which limits their further development and research due to their low content in plant tissues and the long growth cycle of plants (43). In the TPCN database, the content of terpenoids ranged from 0.000001% to 3.744898%. We divided these terpenoids into seven distinct content ranges. The majority of terpenoids were predominantly present within the content range of 10–4% to 10–3% (Figure 2D), which is about a few parts per million.

Subsequently, we conducted further investigations into the number and content distribution of terpenoids in different extraction parts of plants. The results indicated that monoterpenoids were mainly derived from the aerial parts, while sesquiterpenoids exhibited significant distribution in both roots and aerial parts. Diterpenoids primarily originated from whole plants, and triterpenoids were prominently distributed in the roots of plants (Figure 4A). Regarding the content of different extraction parts of plants, the content of the majority of terpenoids was between 10–5% and 10–2%. Particularly, in the case of terpenoids derived from roots and aerial parts of plants, the content of terpenoids in these parts was mostly between 10–4% and 10–3% (Figure 4B).

Alt text: (A) The plant tissues distribution of various terpenoids. (B) The content distribution of terpenoids in each part. The numbers represent the amount of terpenoids.
Figure 4.

The plant tissues and content distribution of terpenoids. (A) The plant tissues distribution of various terpenoids. (B) The content distribution of terpenoids in each part.

The structural similarity calculation and application of higher-content compounds

Most of the high-value natural products have usually low natural abundance and tedious chemical synthesis, which hinder their clinical translation. Structural similarity is one of the key strategies for drug discovery (44). Higher-content compounds that are structurally similar to other high-value compounds have the modification potential to wider applications (24).

To analyze the structural similarity of terpenoids, we utilized RDKit to generate the daylight fingerprint of terpenoids and determine the correlation between them using the Tanimoto coefficient. Our results revealed that out of the examined pairs of terpenoids, 50 978 had a structural similarity of >95% and 6512 pairs exhibited a structural similarity exceeding 99% (Figure 5A).

Alt text: (A) The distribution of terpenoid compound pairs with structural similarity exceeding 0.95. (B) Content ratio between terpenoids with structural similarity exceeding 0.95. (C) The host source, extraction parts and content distribution of terpenoids with structural similarity exceeding 0.95 compared with paeoniflorin.
Figure 5.

The distribution and content variation of structurally similar terpenoids. (A) The distribution of terpenoid compound pairs with structurally similar exceeding 0.95. (B) Content ratio between terpenoids with structurally similar exceeding 0.95. (C) The host source, extraction parts and content distribution of terpenoids with structural similarity exceeding 0.95 compared with paeoniflorin.

Additionally, we measured the content variation among structurally similar terpenoids. The majority of these compounds showed a content variation of less than a 10-fold magnitude. However, there were eight pairs of terpenoids that displayed a structural similarity exceeding 95%, while exhibiting content variations surpassing magnitudes of 100 000 (Figure 5B). For example, despite glycyrrhizin and uralsaponin R sharing a structural similarity of 0.9728, their content variation reached an astonishing magnitude of 166 667 (Table 1).

Table 1.

The variation of terpenoid content with structural similarity exceeding 0.95

CAS numberaStructureaSpeciesaPartaContent (%)aCAS numberbStructurebSpeciesbPartbContent (%)bSimilarityFoldab
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-84-0graphicGlycyrrhiza uralensisRoots0.0000090.9728166 667
57817-89-7graphicStevia rebaudianaLeaves2.52232164849-39-4graphicOrychophragmus violaceusSeeds0.0000180.9674140 129
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-82-8graphicGlycyrrhiza uralensisRoots0.0000110.9728136 364
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-79-3graphicGlycyrrhiza uralensisRoots0.0000110.9828136 364
1422049-84-0graphicAcanthophyllum gypsophiloidesRoots3.7448981704595-03-8graphicMomordica charantiaSeeds0.0000330.9680113 482
121340-61-2graphicBellis sylvestrisWhole plants2.6000001028100-30-2graphicGypsophila oldhamianaRoots0.0000230.9544113 043
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-88-4graphicGlycyrrhiza uralensisRoots0.0000140.9790107 143
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-83-9graphicGlycyrrhiza uralensisRoots0.0000140.9728107 143
CAS numberaStructureaSpeciesaPartaContent (%)aCAS numberbStructurebSpeciesbPartbContent (%)bSimilarityFoldab
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-84-0graphicGlycyrrhiza uralensisRoots0.0000090.9728166 667
57817-89-7graphicStevia rebaudianaLeaves2.52232164849-39-4graphicOrychophragmus violaceusSeeds0.0000180.9674140 129
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-82-8graphicGlycyrrhiza uralensisRoots0.0000110.9728136 364
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-79-3graphicGlycyrrhiza uralensisRoots0.0000110.9828136 364
1422049-84-0graphicAcanthophyllum gypsophiloidesRoots3.7448981704595-03-8graphicMomordica charantiaSeeds0.0000330.9680113 482
121340-61-2graphicBellis sylvestrisWhole plants2.6000001028100-30-2graphicGypsophila oldhamianaRoots0.0000230.9544113 043
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-88-4graphicGlycyrrhiza uralensisRoots0.0000140.9790107 143
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-83-9graphicGlycyrrhiza uralensisRoots0.0000140.9728107 143

Superscript a and b represent two different terpenoids (compound a and compound b), respectively. Foldab represents the content ratio of compound a and compound b. This table shows the compound pairs with a structural similarity of 0.95 or higher and the content ratio >10 000 in the TPCN.

Table 1.

The variation of terpenoid content with structural similarity exceeding 0.95

CAS numberaStructureaSpeciesaPartaContent (%)aCAS numberbStructurebSpeciesbPartbContent (%)bSimilarityFoldab
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-84-0graphicGlycyrrhiza uralensisRoots0.0000090.9728166 667
57817-89-7graphicStevia rebaudianaLeaves2.52232164849-39-4graphicOrychophragmus violaceusSeeds0.0000180.9674140 129
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-82-8graphicGlycyrrhiza uralensisRoots0.0000110.9728136 364
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-79-3graphicGlycyrrhiza uralensisRoots0.0000110.9828136 364
1422049-84-0graphicAcanthophyllum gypsophiloidesRoots3.7448981704595-03-8graphicMomordica charantiaSeeds0.0000330.9680113 482
121340-61-2graphicBellis sylvestrisWhole plants2.6000001028100-30-2graphicGypsophila oldhamianaRoots0.0000230.9544113 043
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-88-4graphicGlycyrrhiza uralensisRoots0.0000140.9790107 143
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-83-9graphicGlycyrrhiza uralensisRoots0.0000140.9728107 143
CAS numberaStructureaSpeciesaPartaContent (%)aCAS numberbStructurebSpeciesbPartbContent (%)bSimilarityFoldab
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-84-0graphicGlycyrrhiza uralensisRoots0.0000090.9728166 667
57817-89-7graphicStevia rebaudianaLeaves2.52232164849-39-4graphicOrychophragmus violaceusSeeds0.0000180.9674140 129
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-82-8graphicGlycyrrhiza uralensisRoots0.0000110.9728136 364
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-79-3graphicGlycyrrhiza uralensisRoots0.0000110.9828136 364
1422049-84-0graphicAcanthophyllum gypsophiloidesRoots3.7448981704595-03-8graphicMomordica charantiaSeeds0.0000330.9680113 482
121340-61-2graphicBellis sylvestrisWhole plants2.6000001028100-30-2graphicGypsophila oldhamianaRoots0.0000230.9544113 043
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-88-4graphicGlycyrrhiza uralensisRoots0.0000140.9790107 143
1405-86-3graphicGlycyrrhiza inflataRoots1.5000001616062-83-9graphicGlycyrrhiza uralensisRoots0.0000140.9728107 143

Superscript a and b represent two different terpenoids (compound a and compound b), respectively. Foldab represents the content ratio of compound a and compound b. This table shows the compound pairs with a structural similarity of 0.95 or higher and the content ratio >10 000 in the TPCN.

Notably, we observed that even closely related species can produce similar chemicals with significant differences in content. As an illustration, Glycyrrhiza inflata contains a higher content of glycyrrhizin compared to the genetically proximate species G. uralensis, which possesses five structurally similar counterparts but with extremely low abundance (Table 1). Similarly, Bellis sylvestris (nomenclature family: Asteraceae, order: Asterales, class: Magnoliopsida) and Gypsophila oldhamiana (nomenclature family: Caryophyllaceae, order: Caryophyllales, class Magnoliopsida), despite having a distant taxonomic relationship, both produce very similar chemicals (Table 1). Using information as such, we were able to identify alternatives to celastrol that exhibit improved drug-like properties, along with increased availability and reduced toxicity (24).

Paeoniflorin is mainly extracted from the Paeoniaceae plants (45, 46). Modern medical studies have shown that paeoniflorin has immunoregulatory, antidepressant, anti-arthritis, antithrombosis, anti-tumor, hepatoprotective, cerebral ischemic injury protective and neuroprotective effects (13, 47). However, paeoniflorin is of low yield and has difficulty in separation of extracts. Moreover, the biosynthesis pathway of paeoniflorin has not been fully elucidated (45), which fundamentally limits the production of paeoniflorin by synthetic biology. Compounds with similarity between 0.95 and 1 compared with paeoniflorin were distributed among five species of the genus Paeonia, and the higher-content compounds could be modified to obtain similar functions to paeoniflorin with higher abundance (Figure 5C).

Similar to the complexity of crop yields (48), the content of a natural product is essentially the result of a cascade of enzymatic catalyzing reactions, which are genetically encoded. However, it should be noted that environmental factors, such as temperature, light and soil composition, as well as human actions like the timing of harvest, processing and storage, also exert a significant impact on the content. We have retained all the original content information in our database. For instance, paeoniflorin’s content in the roots of Paeonia lactiflora from different producing areas exhibited variations of over 300 times from four independent investigations. The content of paeoniflorin in two materials from China closely resembled each other but was significantly lower than the measurements of the other two materials from Vietnam, which also exhibited similar content either (Supplementary Figure S2). In another example, oleanolic acid (OA), a triterpenoid, exists in numerous plant species with content differences of ∼2-fold within the same organs in Eriope blanchetii (Supplementary Figure S3). Furthermore, the major pharmaceutical components showed greater accumulation in Artemisia annua after graphene treatment, suggesting that the graphene-based cultivation strategy offers a novel solution to the problem of low artemisinin content, and the graphene could serve as a nanofertilizer to replace chemical fertilizer and decrease non-point-source pollution derived from agriculture. It is a promising strategy for the cultivation of medicinal plants environmentally friendly. Thus, the high level of compounds may inspire us to select more efficient and environmentally sustainable cultivation methods (49). Clearly, these significant variations may pique the interest of researchers working on these plants, prompting further investigations into the underlying biological or abiotic factors contributing to such differences.

Scaffold and physicochemical properties of higher-content compounds

To further explore the relationship between the content and structure of terpenoids, the Murcko scaffolds of terpenoids in different content ranges were generated using RDKit. The top five Murcko scaffolds with the highest frequency in various content ranges were shown (Figure 6). The results showed that 1,2,3,4,4a,5,6,6a,6b,7,8,8a,9 101 112,12a,12b,13,14b-icosahydropicene was the most frequently occurring scaffold structure among terpenoids across diverse content ranges. This is related to the fact that triterpenoids make up the majority of terpenoids in the database. Additionally, terpenoids with higher content tended to exhibit more oxygenated furan rings and oxygenated pyran rings. These structures are the core of many sugars and sugar-like units. To further investigate the content and glycosylation ratio of terpenoids, the glycosylation ratio of terpenoids from different content ranges was calculated. The findings revealed a positive correlation between the content of terpenoids and their glycosylation levels (Table 2). This could be attributed to the fact that glycosylation can enhance the water solubility and stability of terpenoids, thereby facilitating their storage.

Alt text: The numbers in brackets indicate the content range of terpenoids. The numbers below the compound structure represent the count of terpenoids with that scaffold.
Figure 6.

The dominant Murcko scaffold of terpenoids in different content ranges. The numbers represent the count of terpenoids with the scaffold.

Table 2.

The glycosylation ratio of terpenoids in different content ranges

Content (%)AllGlycosidesPercentages
[10–6,10–5)1455638.62
[10–5,10–4)133953439.88
[10–4,10–3)3080139745.36
[10–3,10–2)183398153.52
[10–2,10–1)43927261.96
[10–1,1)756282.67
[1,10)8562.50
Content (%)AllGlycosidesPercentages
[10–6,10–5)1455638.62
[10–5,10–4)133953439.88
[10–4,10–3)3080139745.36
[10–3,10–2)183398153.52
[10–2,10–1)43927261.96
[10–1,1)756282.67
[1,10)8562.50
Table 2.

The glycosylation ratio of terpenoids in different content ranges

Content (%)AllGlycosidesPercentages
[10–6,10–5)1455638.62
[10–5,10–4)133953439.88
[10–4,10–3)3080139745.36
[10–3,10–2)183398153.52
[10–2,10–1)43927261.96
[10–1,1)756282.67
[1,10)8562.50
Content (%)AllGlycosidesPercentages
[10–6,10–5)1455638.62
[10–5,10–4)133953439.88
[10–4,10–3)3080139745.36
[10–3,10–2)183398153.52
[10–2,10–1)43927261.96
[10–1,1)756282.67
[1,10)8562.50

We also conducted an in-depth analysis of the relationship between the content and physicochemical properties of terpenoids. The 11 physicochemical properties of terpenoids were determined using RDKit. The results demonstrate that certain physicochemical properties correlated with molecular size and complexity, including HBA (Figure 7A), HBD (Figure 7B), TPSA, MW, NumRotatableBonds, NumHeavyAtoms (Figure 7D–G), FractionCsp3 (Figure 7K) as well as the RingCount (Figure 7I), and NumAliphaticRings (Figure 7J), showed a positive correlation with the content of terpenoids. However, the NumAromaticRings (Figure 7H) and AlogP (Figure 7C) negatively correlated with the content of terpenoids. This indicates that the larger, more complex, and more hydrophilic a compound is, the higher its content may be. It may be attributed to the introduction of sugar units through glycosylation modification in terpenoids.

Alt text: (A) Hydrogen bond acceptor (HBA), (B) hydrogen bond donor (HBD), (C) octanol–water partition coefficient (AlogP), (D) topological polar surface area (TPSA), (E) molecular weight (MW), (F) number of rotatable bonds (NumRotatableBonds), (G) number of heavy atoms (NumHeavyAtoms), (H) number of aromatic rings (NumAromaticRings), (I) number of rings (RingCount), (J) number of aliphatic rings (NumAliphaticRings) and (K) fraction of Csp3 atoms (FractionCsp3).
Figure 7.

Physicochemical properties of terpenoids in different content ranges. (A) The hydrogen bond acceptors, HBA. (B) The hydrogen bond donors, HBD. (C) The octanol-water partition coefficient, AlogP. (D) The topological polar surface area, TPSA. (E) The molecular weight, MW. (F)The number of rotatable bonds, NumRotatableBonds. (G) The number of heavy atoms, NumHeavyAtoms. (H) The number of aromatic rings, NumAromaticRings. (I) The Ring Count, RingCount. (J) The number of aliphatic rings, NumAliphaticRings. (K) The fraction of sp3 hybridized carbons, FractionCSP3.

Examples of TPCN with ginsenosides

Ginsenosides are specialized triterpene saponins uniquely present in the Panax species (48). Among the species of P. genus (50), P. ginseng (51), P. notoginseng (52), P. quinquefolius (53) and P. japonicus (54) have been widely used as medicinal and functional food. At present, most of the isolated ginsenosides can usually be divided into Dammarane type (DM type), OA type and Ocotillol type (OCT type) according to the structural differences of their glycosides. According to the difference in the hydroxyl ligand at the C6 position, DM-type ginsenosides are divided into protopanaxadiol-type (PPD-type) ginsenosides and protopanaxatriol-type (PPT-type) ginsenosides (55). Among all ginsenosides, tetracyclic triterpene DM saponins accounted for the majority of ginsenosides. Among the saponins isolated in ginseng, PPD-type ginsenosides have the most types and the highest content, followed by PPT-type, and OA-type ginsenosides have the least types and lowest content. The higher content of ginsenosides Rb1, Rb2, Rc, Rd, Re and Rg1 (the main ginsenosides, accounting for >80% of the total ginseng saponins) contains more saccharide groups and are more hydrophilic, but their biological activity is low, and the absorption rate in the human body is also very low (Figure 8). Rare ginsenosides (Rh2, Rg3, etc.) contain less glycosyl, have better biological activity and higher body absorption rate and play a significant role in regulating metabolism, promoting cell differentiation and resisting tumors (56, 57); however, their content in natural ginseng plants is very low. The types and contents of ginsenosides contained in different species of ginseng plants are different, and the content of ginsenosides in the same ginseng plants is also very different. The main active ingredients of P. ginseng, P. quinquefolius, P. notoginseng and other medicinal materials are DM-type saponins, while the main active ingredients of P. japonicus are OA-type ginsenosides and contain a small amount of DM-type ginsenosides (Figure 8).

Alt text: Protopanaxadiol-type (PPD-type), protopanaxatriol-type (PPT-type), oleanolic acid type (OA-type) and ocotillol type (OCT-type).
Figure 8.

The host source, extraction parts and content distribution of ginsenosides.

Thus, different genotypes of ginseng plants influence the ginsenoside type and content. Although P. quinquefolius, P. ginseng and P. notoginseng are morphologically and phylogenetically close, each ginseng species contains characteristic types and/or levels of ginsenosides. These differences among various ginseng species reflect the genetic diversity in synthesis and accumulation of ginsenosides in different ginseng species.

Web interface of TPCN

In order to facilitate the application of this database and to continually expand the amount of data and add more information, we have hosted this database on the website (http://www.tpcn.pro/). TPCN was designed to include home, browse, search, analysis, download and help document interfaces. The home interface provides an overview of the introduction, data composition and data sources of the database. In addition, it also enables users to browse a specific category of terpenoids by clicking on the corresponding module (Figure 9A). The browse interface consists of table browse and card browse (Figure 9B). Users can browse the detailed information of that compound by clicking on the respective molecular image, including its structure, content and physicochemical properties (Figure 9C). The search interface allows users to utilize various search criteria to retrieve relevant compounds, including basic information, physicochemical properties, Murcko scaffold and the structure of terpenoids. The basic information search encompasses several key components, namely, the name, smiles, chemical abstracts service registry number (CAS number), molecular formula, biological source (family and species), extraction part and classification of terpenoids. The physicochemical properties search and Murcko scaffold search provide users with the ability to narrow down their search for target compounds based on specific physicochemical properties or Murcko scaffold. The structure search involves three distinct search modes: exact search, substructure search and similarity search. A plugin (58) for chemical structure drawing is integrated into the web page, which can be used for structural searching (Figure 9D). The analysis interface displays all the Murcko scaffolds of terpenoids in the database (Figure 9E), as well as the similarity and content variations of terpenoids with structural similarity exceeding 0.95 (Figure 9F). The download interface allows users to download the structures of the terpenoids as well as their species sources. The detailed functionality and usage of the database are provided in the help document interface.

Alt text: (A) Home, (B) browse, (C) compound detail, (D) structural search, (E) scaffold analysis and (F) similarity analysis.
Figure 9.

The web pages of TPCN database. (A) Home; (B) Browse; (C) Compound detail; (D) Structural search; (E) Scaffold analysis; (F) Similarity analysis.

Conclusion

Terpenoid natural products exhibit intricate molecular structures and possess immense potential for pharmacological applications, making them a highly valuable resource for drug discovery. Despite the extensively documented efficacy of plants as sources of terpenoids, the sustainable and economically feasible production of most of these compounds in significant quantities remains a formidable challenge, particularly in cases where extraction from plants is necessary. Consequently, attaining high yields of natural products becomes a pivotal factor in augmenting agricultural productivity and fostering environmental sustainability. In response to this challenge, the comprehensive platform TPCN has been devised. TPCN presently serves as the most extensive repository of comprehensive data on molecule structures, biological sources and extraction methods, offering significant assistance to researchers in the meticulous selection of suitable species for breeding, extraction of phytochemicals and identification of alternative candidates for drug discovery derived from natural products. The development of TPCN shall also advance our understanding of the fundamental biosynthetic mechanisms underlying natural products and, more specifically, their chemical diversity, encompassing both qualitative and quantitative aspects as invaluable phenotypic characteristics that have progressively evolved over time. For economically significant natural product–based drugs and their alternatives, the heightened chemical content found in plants represents a heritable trait stemming from the efficiency of photosynthesis and secondary metabolic transformations. This attribute confers substantial benefits in terms of eco-friendliness, cost-effectiveness and practicality within the realm of pharmaceutical agriculture.

Supplementary Material

Supplementary Material is available at Database online.

Data availability

All the data used in this manuscript are collected from the literature and various online database resources and can be freely accessed or downloaded by visiting our website (http://www.tpcn.pro/). The source code used in this analysis is available from https://github.com/ylchen0622/TPCN.

Funding

National Science & Technology Fundamental Resources Investigation Program of China (No. 2018FY100704 to X.L.); the National Natural Science Foundation of China (No. 21977033 to D.X.K.), Key R&D Program of Hubei Province for International Cooperation (No. 2022EHB047 to X.H.); Shennongjia Academy of Forestry, Hubei, China (No. SAF202102 to X.H.); Hubei Technology Innovation Center for Agricultural Sciences—‘2020 key technology research and demonstration project of safe and efficient production of genuine medicinal materials’ (No. 2020–620-000-002-04 to X.H.); Hubei Provincial Administration of Traditional Chinese Medicine (No. ZY2023Z005 to S.W.); and Yunnan Science and Technology Program (No. 202205AF150004 to X.X.).

Conflict of interest

None declared.

References

1.

Atanasov
A.G.
,
Zotchev
S.B.
,
Dirsch
V.M.
 et al.  (
2021
)
Natural products in drug discovery: advances and opportunities
.
Nat. Rev.: Drug Discov.
,
20
,
200
216
.

2.

Huang
B.
and
Zhang
Y.
(
2022
)
Teaching an old dog new tricks: drug discovery by repositioning natural products and their derivatives
.
Drug Discov. Today
,
27
,
1936
1944
.

3.

Dong
S.
,
Guo
X.
,
Han
F.
 et al.  (
2022
)
Emerging role of natural products in cancer immunotherapy
.
Acta Pharm. Sin. B
,
12
,
1163
1185
.

4.

Hill
R.A.
and
Connollya
J.D.
(
2013
)
Triterpenoids
.
Nat. Prod. Rep.
,
30
,
1028
1065
.

5.

Pichersky
E.
and
Raguso
R.A.
(
2018
)
Why do plants produce so many terpenoid compounds?
 
New Phytol.
,
220
,
692
702
.

6.

Tetali
S.D.
(
2019
)
Terpenes and isoprenoids: a wealth of compounds for global use
.
Planta
,
249
,
1
8
.

7.

Zeng
T.
,
Liu
Z.
,
Zhuang
J.
 et al.  (
2020
)
TeroKit: a database-driven web server for terpenome research
.
J. Chem. Inf. Model.
,
60
,
2082
2090
.

8.

Dai
Y.
,
Chen
S.R.
,
Chai
L.
 et al.  (
2019
)
Overview of pharmacological activities of Andrographis paniculata and its major compound andrographolide
.
Crit. Rev. Food Sci. Nutr.
,
59
,
17
29
.

9.

Powell
M.A.
,
Filiaci
V.L.
,
Hensley
M.L.
 et al.  (
2022
)
Randomized phase III trial of paclitaxel and carboplatin versus paclitaxel and ifosfamide in patients with carcinosarcoma of the uterus or ovary: an NRG oncology Trial
.
J. Clin. Oncol.
,
40
,
968
977
.

10.

Ma
N.
,
Zhang
Z.
,
Liao
F.
 et al.  (
2020
)
The birth of artemisinin
.
Pharmacol. Ther.
,
216
, 107658.

11.

Li
X.J.
,
Jiang
Z.Z.
and
Zhang
L.Y.
(
2014
)
Triptolide: progress on research in pharmacodynamics and toxicology
.
J. Ethnopharmacol.
,
155
,
67
79
.

12.

Liu
L.
,
Ding
Z.
,
Yang
Y.
 et al.  (
2021
)
Asiaticoside-laden silk nanofiber hydrogels to regulate inflammation and angiogenesis for scarless skin regeneration
.
Biomater. Sci.
,
9
,
5227
5236
.

13.

Zhang
L.
and
Wei
W.
(
2020
)
Anti-inflammatory and immunoregulatory effects of paeoniflorin and total glucosides of paeony
.
Pharmacol. Ther.
,
207
, 107452.

14.

Brinckmann
J.A.
,
Kathe
W.
,
Berkhoudt
K.
 et al.  (
2022
)
A new global estimation of medicinal and aromatic plant species in commercial cultivation and their conservation status
.
Econ. Bot.
,
76
,
319
333
.

15.

Wang
H.
,
Zhang
X.B.
,
Wang
J.
 et al.  (
2022
)
Statistical analysis of planting area of Chinese medicinal materials in China in 2020
.
China Food Drug Admin.
 
20
,
4
9
.

16.

Gusain
P.
Uniyal
D.
Joga
R.
(
2021
) Conservation and sustainable use of medicinal plants. In:
Egbuna
 
C.
,
Mishra
 
A.P.
and
Goyal
 
M.R.
(eds.),
Preparation of Phytopharmaceuticals for the Management of Disorders
.
Elsevier
,
Amsterdam, The Netherlands
, pp.
409
427
.

17.

Xu
H.
,
Zhang
W.
,
Zhou
Y.
 et al.  (
2023
)
Systematic description of the content variation of natural products (NPs): to prompt the yield of high-value NPs and the discovery of new therapeutics
.
J. Chem. Inf. Model.
,
63
,
1615
1625
.

18.

Yang
L.
,
Wen
K.S.
,
Ruan
X.
 et al.  (
2018
)
Response of plant secondary metabolites to environmental factors
.
Molecules
,
23
,
1
26
.

19.

Mahajan
M.
,
Kuiry
R.
and
Pal
P.K.
(
2020
)
Understanding the consequence of environmental stress for accumulation of secondary metabolites in medicinal and aromatic plants
.
J. Appl. Res. Med. Aromat. Plants
,
18
,
1
10
.

20.

Grenade
N.L.
,
Chiriac
D.S.
,
Pasternak
A.R.O.
 et al.  (
2023
)
Discovery of a tambjamine gene cluster in streptomyces suggests convergent evolution in bipyrrole natural product biosynthesis
.
ACS Chem. Biol.
,
18
,
223
229
.

21.

Torrens-Spence
M.P.
,
Chiang
Y.C.
,
Smith
T.
 et al.  (
2020
)
Structural basis for divergent and convergent evolution of catalytic machineries in plant aromatic amino acid decarboxylase proteins
.
Proc. Natl. Acad. Sci. U.S.A.
,
117
,
10806
10817
.

22.

Huang
A.C.
,
Kautsar
S.A.
,
Hong
Y.J.
 et al.  (
2017
)
Unearthing a sesterterpene biosynthetic repertoire in the Brassicaceae through genome mining reveals convergent evolution
.
Proc. Natl. Acad. Sci. U.S.A.
,
114
,
E6005
E6014
.

23.

Pichersky
E.
and
Lewinsohn
E.
(
2011
)
Convergent evolution in plant specialized metabolism
.
Annu. Rev. Plant Biol.
,
62
,
549
566
.

24.

Zhou
B.
,
Yuan
Y.
,
Shi
L.
 et al.  (
2021
)
Creation of an anti-inflammatory, leptin-dependent anti-obesity celastrol mimic with better druggability
.
Front. Pharmacol.
,
12
,
1
17
.

25.

David
B.
,
Wolfender
J.L.
and
Dias
D.A.
(
2015
)
The pharmaceutical industry and natural products: historical status and new trends
.
Phytochem. Rev.
,
14
,
299
315
.

26.

Lautie
E.
,
Russo
O.
,
Ducrot
P.
 et al.  (
2020
)
Unraveling plant natural chemical diversity for drug discovery purposes
.
Front. Pharmacol.
,
11
, 397.

27.

Zeng
T.
,
Chen
Y.
,
Jian
Y.
 et al.  (
2022
)
Chemotaxonomic investigation of plant terpenoids with an established database (TeroMOL)
.
New Phytol.
,
235
,
662
673
.

28.

Chen
N.
,
Zhang
R.
,
Zeng
T.
 et al.  (
2023
)
Developing TeroENZ and TeroMAP modules for the terpenome research platform TeroKit
.
Database
,
2023
, baad020.

29.

Landrum
G.A.
(
2013
)
RDKit: a software suite for cheminformatics, computational chemistry, and predictive modeling
.
Greg. Landrum
,
8
, 31.

30.

Bajorath
,
J.
(eds) (
2011
)
Chemoinformatics and Computational Chemical Biology
Vol.
672
,
Humana
,
Totowa, NJ, USA
.

31.

Flower
D.R.
(
1998
)
On the properties of bit string-based measures of chemical similarity
.
J. Chem. Inf. Comput. Sci.
,
38
,
379
386
.

32.

Bero
S.A.
,
Muda
A.K.
,
Choo
Y.H.
 et al.  (
2017
)
Similarity measure for molecular structure: A brief review
.
J. Phys.: Conf. Ser.
,
892
, 012015.

33.

Willett
P.
,
Barnard
J.M.
and
Downs
G.M.
(
1998
)
Chemical similarity searching
.
J. Chem. Inf. Comput. Sci.
,
38
,
983
996
.

34.

Bemis
G.W.
and
Murcko
M.A.
(
1996
)
The properties of known drugs 1. Molecular frameworks
.
J. Med. Chem.
,
39
,
2887
2893
.

35.

Schaub
J.
,
Zielesny
A.
,
Steinbeck
C.
 et al.  (
2020
)
Too sweet: cheminformatics for deglycosylation in natural products
.
J. Cheminform.
,
12
,
1
20
.

36.

Chen
Y.
,
Liu
Y.
,
Chen
N.
 et al.  (
2023
)
A chemoinformatic analysis on natural glycosides with respect to biological origin and structural class
.
Nat. Prod. Rep.
,
40
,
1469
1478
.

37.

Jiang
Y.
,
Zhang
W.
,
Chen
X.
 et al.  (
2021
)
Diversity and biosynthesis of volatile terpenoid secondary metabolites in the chrysanthemum genus
.
Crit. Rev. Plant Sci.
,
40
,
422
445
.

38.

Zhang
K.
,
Wang
N.
,
Gao
X.
 et al.  (
2022
)
Integrated metabolite profiling and transcriptome analysis reveals tissue-specific regulation of terpenoid biosynthesis in Artemisia argyi
.
Genomics
,
114
, 110388.

39.

Jiang
C.
,
Fei
X.
,
Pan
X.
 et al.  (
2021
)
Tissue-specific transcriptome and metabolome analyses reveal a gene module regulating the terpenoid biosynthesis in Curcuma wenyujin
.
Ind. Crop. Prod.
,
170
, 113758.

40.

Dossou
S.S.K.
,
Xu
F.
,
Cui
X.
 et al.  (
2021
)
Comparative metabolomics analysis of different sesame (Sesamum indicum L.) tissues reveals a tissue-specific accumulation of metabolites
.
BMC Plant Biol.
,
21
,
1
14
.

41.

Amini
H.
,
Naghavi
M.R.
,
Shen
T.
 et al.  (
2019
)
Tissue-specific transcriptome analysis reveals candidate genes for terpenoid and phenylpropanoid metabolism in the medicinal plant Ferula assafoetida
.
G3-Genes Genomes Genet.
,
9
,
807
816
.

42.

Sarropoulou
V.
,
Sarrou
E.
,
Angeli
A.
 et al.  (
2023
)
Species-specific secondary metabolites from Primula veris subsp. veris obtained in vitro adventitious root cultures: an alternative for sustainable production
.
Sustainability
,
15
,
1
12
.

43.

Li
Y.
,
Wang
J.
,
Li
L.
 et al.  (
2022
)
Natural products of pentacyclic triterpenoids: from discovery to heterologous biosynthesis
.
Nat. Prod. Rep.
,
40
,
1303
1353
.

44.

Wei
W.
,
Cherukupalli
S.
,
Jing
L.
 et al.  (
2020
)
Fsp(3): a new parameter for drug-likeness
.
Drug Discov. Today
,
25
,
1839
1845
.

45.

Zhang
X.X.
,
Zuo
J.Q.
,
Wang
Y.T.
 et al.  (
2022
)
Paeoniflorin in Paeoniaceae: distribution, influencing factors, and biosynthesis
.
Front. Plant Sci.
,
13
, 980854.

46.

Tabata
K.
,
Matsumoto
K.
and
Watanabe
H.
(
2000
)
Paeoniflorin, a major constituent of peony root, reverses muscarinic M1-receptor antagonist-induced suppression of long-term potentiation in the rat hippocampal slice
.
Jap. J. Pharmacol.
,
83
,
25
30
.

47.

Peng
W.
,
Chen
Y.
,
Tumilty
S.
 et al.  (
2022
)
Paeoniflorin is a promising natural monomer for neurodegenerative diseases via modulation of Ca2+ and ROS homeostasis
.
Curr. Opin. Pharm.
,
62
,
97
102
.

48.

Bailey-Serres
J.
,
Parker
J.E.
,
Ainsworth
E.A.
 et al.  (
2019
)
Genetic strategies for improving crop yields
.
Nature
,
575
,
109
118
.

49.

Cao
J.
,
Chen
Z.
,
Wang
L.
 et al.  (
2023
)
Graphene enhances artemisinin production in traditional medicinal plant Artemisia annua via dynamic physiological progress and miRNA regulation
.
Plant Commun.
 
5
, 100742.

50.

Hou
M.
,
Wang
R.
,
Zhao
S.
 et al.  (
2021
)
Ginsenosides in Panax genus and their biosynthesis
.
Acta Pharm. Sin. B
,
11
,
1813
1834
.

51.

Xue
Q.
,
He
N.
,
Wang
Z.
 et al.  (
2021
)
Functional roles and mechanisms of ginsenosides from Panax ginseng in atherosclerosis
.
J. Ginseng. Res.
,
45
,
22
31
.

52.

Wang
T.
,
Guo
R.
,
Zhou
G.
 et al.  (
2016
)
Traditional uses, botany, phytochemistry, pharmacology and toxicology of Panax notoginseng (Burk.) F.H. Chen: a review
.
J. Ethnopharmacol.
,
188
,
234
258
.

53.

Mancuso
C.
and
Santangelo
R.
(
2017
)
Panax ginseng and Panax quinquefolius: from pharmacology to toxicology
.
Food Chem. Toxicol.
,
107
,
362
372
.

54.

Wang
X.J.
,
Xie
Q.
,
Liu
Y.
 et al.  (
2021
)
Panax japonicus and chikusetsusaponins: a review of diverse biological activities and pharmacology mechanism
.
Chin. Herb. Med.
,
13
,
64
77
.

55.

Yang
W.Z.
,
Hu
Y.
,
Wu
W.Y.
 et al.  (
2014
)
Saponins in the genus Panax L. (Araliaceae): a systematic review of their chemical diversity
.
Phytochemistry
,
106
,
7
24
.

56.

Xu
W.
,
Lyu
W.
,
Duan
C.
 et al.  (
2023
)
Preparation and bioactivity of the rare ginsenosides Rg3 and Rh2: an updated review
.
Fitoterapia
,
167
, 105514.

57.

Li
M.
,
Ma
M.
,
Wu
Z.
 et al.  (
2023
)
Advances in the biosynthesis and metabolic engineering of rare ginsenosides
.
Appl. Microbiol. Biotechnol.
,
107
,
3391
3404
.

58.

Bienfait
B.
and
Ertl
P.
(
2013
)
JSME: a free molecule editor in JavaScript
.
J. Cheminform.
,
5
, 24.

Author notes

The first two authors contributed equally to this study.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data