EnzyMine: a comprehensive database for enzyme function annotation with enzymatic reaction chemical feature Open Access

Detailed comparison of the content of databases

Database	Enzyme	Mechanism	Reaction	Chemical feature
Database	Enzyme	Mechanism	Reaction	Reaction centre	Reaction rule	Core-to-core
ExplorEnz (15)	√	—	—	—	—	—
EzCatDB (16)	√	√	√	—	—	—
BRENDA (17)	√	—	√	—	—	—
BNICE (24)	—	—	√	—	√	—
RetroRules (22)	—	—	√	—	√	—
EnzyMine	√	√	√	√	√	√

Database	Enzyme	Mechanism	Reaction	Chemical feature
Database	Enzyme	Mechanism	Reaction	Reaction centre	Reaction rule	Core-to-core
ExplorEnz (15)	√	—	—	—	—	—
EzCatDB (16)	√	√	√	—	—	—
BRENDA (17)	√	—	√	—	—	—
BNICE (24)	—	—	√	—	√	—
RetroRules (22)	—	—	√	—	√	—
EnzyMine	√	√	√	√	√	√

Table 1.

Detailed comparison of the content of databases

Database	Enzyme	Mechanism	Reaction	Chemical feature
Database	Enzyme	Mechanism	Reaction	Reaction centre	Reaction rule	Core-to-core
ExplorEnz (15)	√	—	—	—	—	—
EzCatDB (16)	√	√	√	—	—	—
BRENDA (17)	√	—	√	—	—	—
BNICE (24)	—	—	√	—	√	—
RetroRules (22)	—	—	√	—	√	—
EnzyMine	√	√	√	√	√	√

Database	Enzyme	Mechanism	Reaction	Chemical feature
Database	Enzyme	Mechanism	Reaction	Reaction centre	Reaction rule	Core-to-core
ExplorEnz (15)	√	—	—	—	—	—
EzCatDB (16)	√	√	√	—	—	—
BRENDA (17)	√	—	√	—	—	—
BNICE (24)	—	—	√	—	√	—
RetroRules (22)	—	—	√	—	√	—
EnzyMine	√	√	√	√	√	√

To conquer the deficiencies mentioned earlier, we proposed EnzyMine, a comprehensive enzyme feature and annotation database. First, to describe the diversity of enzyme data and provide a comprehensive enzyme feature, we collected sequence-structure-related features and catalytic features scattered in the above databases. Next, we replenished a systematic reaction feature mining of chemical structure. Reaction features are expanded beyond reaction compounds to further include reaction centre, reaction rule and core-to-core analysis using our previous work, Rxnblast (23). Core-to-Core is defined as the scaffold transformation made up of atoms that change during the reaction. In particular, core-to-core analysis can show the unique and complex structural scaffold change in an enzymatic reaction, helping to demonstrate the role of the enzyme on the molecular scaffold. This chemical feature mining can intuitively aid the understanding of chemical composition changes and reaction patterns.

In this report, we mined 7767 EC numbers and displayed these enzymes with comprehensive sequence and structural feature visually in 267 345 protein sequences across 8058 organisms, along with deep calculated chemical features in 9831 reactions. This study introduces the process of reaction feature mining and analysis, molecule structure-based searching methods and makes the following contributions: (i) curation of scattered enzyme sequence, structure, catalytic data; (ii) chemical feature mining and analysis of enzymatic reactions and (iii) efficient searching methods based on chemical structure query and text query.

Materials and methods

Data collection and database content

Using EC numbers in EXPASY release (2020/2/27) (ftp://ftp.expasy.org/databases/enzyme/enzyme.dat), EnzyMine displays 7767 enzymes with complete sequences, structural data and vital interface links to referenced databases. The databases include information on enzyme function, sequence information and family classification, as well as protein structure, molecular structure, known reactions and associated literature. Sources and the amount of available data are shown in Table 2.

Table 2.

Current contents of EnzyMine

Related data	Description	Number
EC number	Enzyme commission number	7767
UniProt	Protein sequence data from UniProt (9)	267 345
Organism	Number of species under EC number	8058
PROSITE	Protein domains and functional sites in PROSITE (6)	1450
CATH	Protein superfamily classification (5)	2017
SCOP	Structural classification of protein (8)	9514
PDB	Tertiary structure data from Protein Data Bank (10)	141 414
Catalytic mechanisms	Detailed catalytic mechanisms of an enzyme in M-CSA (7)	864
Compound data	Compound data for substrates/products/cofactors	27 765
Reaction data	Enzymatic reactions with EC number in Rhea (16), KEGG (17)	9831
Reference	Reference about Enzyme and reaction	39 833

Related data	Description	Number
EC number	Enzyme commission number	7767
UniProt	Protein sequence data from UniProt (9)	267 345
Organism	Number of species under EC number	8058
PROSITE	Protein domains and functional sites in PROSITE (6)	1450
CATH	Protein superfamily classification (5)	2017
SCOP	Structural classification of protein (8)	9514
PDB	Tertiary structure data from Protein Data Bank (10)	141 414
Catalytic mechanisms	Detailed catalytic mechanisms of an enzyme in M-CSA (7)	864
Compound data	Compound data for substrates/products/cofactors	27 765
Reaction data	Enzymatic reactions with EC number in Rhea (16), KEGG (17)	9831
Reference	Reference about Enzyme and reaction	39 833

Table 2.

Current contents of EnzyMine

Related data	Description	Number
EC number	Enzyme commission number	7767
UniProt	Protein sequence data from UniProt (9)	267 345
Organism	Number of species under EC number	8058
PROSITE	Protein domains and functional sites in PROSITE (6)	1450
CATH	Protein superfamily classification (5)	2017
SCOP	Structural classification of protein (8)	9514
PDB	Tertiary structure data from Protein Data Bank (10)	141 414
Catalytic mechanisms	Detailed catalytic mechanisms of an enzyme in M-CSA (7)	864
Compound data	Compound data for substrates/products/cofactors	27 765
Reaction data	Enzymatic reactions with EC number in Rhea (16), KEGG (17)	9831
Reference	Reference about Enzyme and reaction	39 833

Related data	Description	Number
EC number	Enzyme commission number	7767
UniProt	Protein sequence data from UniProt (9)	267 345
Organism	Number of species under EC number	8058
PROSITE	Protein domains and functional sites in PROSITE (6)	1450
CATH	Protein superfamily classification (5)	2017
SCOP	Structural classification of protein (8)	9514
PDB	Tertiary structure data from Protein Data Bank (10)	141 414
Catalytic mechanisms	Detailed catalytic mechanisms of an enzyme in M-CSA (7)	864
Compound data	Compound data for substrates/products/cofactors	27 765
Reaction data	Enzymatic reactions with EC number in Rhea (16), KEGG (17)	9831
Reference	Reference about Enzyme and reaction	39 833

Database architecture

EnzyMine is built on the Python web development framework, Django, with information stored in the PostgreSQL relational database. The front-end of the site is designed with HTML, CSS, Bootstrap, jQuery and JavaScript. Fuzzy search methods are completed by Haystack (http://haystacksearch.org/), an open-source and integrated full-text search engine in Django. Unique molecule searching methods are completed by RDKit (http://www.rdkit.org/), an open-source chemical informatics and machine learning kit that provides C++ and Python API interface. The architecture of the process is shown in Figure 1.

Figure 1.

The broad architecture of EnzyMine processing, including the front-end and back-end search engines, which return displayed data.

The algorithm in query methods

Tanimoto coefficient performs remarkably well for similarity-based virtual screening by comparing chemical structures and is an appropriate choice for quantifying molecular similarity calculations. In particular, the Tanimoto coefficient-based similarity algorithm is used to construct a similarity retrieval method (24). Maximum common substructure (MCS) searching is established on the fMCS algorithm (25), providing a flexible and alternative powerful way to search by calculating the distance between two molecules. MCS percentage can be represented as a mathematical equation, T_MCS (A, B):

$$\begin{equation}{T_{MCS}}\left( {A,B} \right) = {{{{\left| {MCS\left( {A,B} \right)} \right|}_a}} \over {{{\left| A \right|}_a} + {{\left| B \right|}_a} + |MCS\left( {A,B} \right){|_a}}}\end{equation}$$

(1)

In this equation, |A|a, |B|a represent the number of atoms for the query molecule and another molecule in our database, respectively. |MCS (A, B)|a, in this case, is the number of atoms in the MCS between the two molecules.

Chemical feature extraction in an enzymatic reaction

The chemical feature mining strategy in EnzyMine is to perform regular extraction of enzymatic reactions, making great efforts in feature visualization, including reaction centre and scaffold transformation. In EnzyMine, we adopt SMARTS (SMILES arbitrary target specification), an expanded version of core-SMILES (simplified molecular-input line-entry system), as a code for reaction analysis (https://www.daylight.com/dayhtml/doc/theory/). Generally, atoms that change within a reaction are described as reaction centres, which are equivalent to atoms attached to the bonds that are broken or formed (21). Reaction rule summarizes the overall atom-bond pattern change taking place in the reaction and can be different according to the radius taken around the reaction centre (26). We used the RxnBLAST tool to extract the characteristics of reaction environment, including reaction centre, reaction rules and core-to-core, which focuses on extracting scaffold transformation and reactive chemical environment features by analysing atom-atom mapping (23). Our approach for visualizing the reaction centre and reaction rules is described in Figure 2. This generative process is divided into three parts. (1) The reaction centre is generated from atom-atom mapping. (2) Adjacent neighbours are characterized to the reaction centre using different radius; here, we choose a radius of 1 for specificity. (3) For multi-substrates/products reactions, reaction rules are extracted by assigning each reactant as the main reactant molecule at a time to limit the combinatorial complexity.

Figure 2.

Calculation of the reaction rules for EC 1.97.1.2. (A) The reaction transformation is labelled. (B) Different radius can be used to identify adjacent neighbours in the reaction. (C, E) The reaction rules are calculated for two reactants. (D, F) reaction rules encoded with SMART.

Results

Database overview

EnzyMine displays 7767 enzyme items complemented with sequences, structural data, family classification, as well as molecular structure, known reactions and associated literature. Different coverage of EC number under each database is described in Figure 3A. With the rapid emergence of sequence data and improvement of sequencing methods, all the enzyme numbers are covered by sequencing data, while reaction data and family classification also reach a high coverage. At the same time, precise 3D structural data of enzymes still need to be revealed further by experimental studies.

Figure 3.

(A) Wynn diagram of four database coverage of EC number, a: sequence amount in UniProt, b: reaction amount in Rhea, c: family classification amount in PROSITE, d: 3D structure amount in Protein Data Bank. (B) The overall distribution of core-to-core across the seven types of enzyme.

At the reaction level, 9831 intuitive display results are found for the centre of enzyme-catalysed reactions, along with the extracted reaction rules and core-to-core characterization. In addition, in order to meet user’s design requirements of retrosynthesis, the website provides online calculation of reaction rules of different radii. Seven types of core-to-core distribution are shown in Figure 3B. In the distribution diagram, type two transferases react with a wide range of substrates, having the largest amount of core-to-core types. In addition, a small amount of core-to-core interaction can occur in the reaction across different types of enzymes, and this phenomenon occurs most frequently between the type two transferases and type three hydrolases.

Search results and web usage

At the homepage, EnzyMine provides various search methods based on text and molecular retrieval algorithms When a user enters a query related to an ambiguous enzyme name and organism, the Haystack engine will show all possible results. Additionally, similarity and MCS searches for molecular queries will mainly return a series of outputs with similarity score/MCS percentage ranging from 1.0 or less. The detailed result is calculated using Equation (1). Compounds in the existing reactions can be searched by chemical structure or sequence similarity, and MCS processing provides metabolic results for many more molecules. At the all data page, we have listed the latest changes associated with each piece of data, such as the deletion of item 5.3.99.1, the transfer of 5.3.3.15 to 5.3.2.7 and the continuous increase of seven enzymes. Selecting an organism will return enzymes specific to that organism.

These powerful search methods and result examples are shown in Figure 4A. On the page with details, the result is divided into three parts: Basic info, sequence analysis and reaction feature analysis. First, the basic info part displays enzyme information, 3D structures and related references, as shown in Figure 4B. Second, in sequence analysis, EnzyMine lists UniProt data in all organisms and function patterns in PROSITE under the EC number, as shown in Figure 4C. Multi-sequence alignment and phylogenetic tree facilitated sequence analysis help to provide a better understanding. Finally, 9831 calculated reaction centres, rules and core-to-core are provided in corresponding reaction analyses, as shown in Figure 4D. For enzymes with elucidated catalytic mechanisms in the sequence, we collected catalytic mechanisms from M-CSA, each connected to enzymatic reaction features shown by Sankey charts/Force-layout graphs. Force-layout graphs stress enzymatic catalytic features with active sites registered in UniProt and reaction under the EC number. Sankey diagrams offer more detail into the catalytic mechanism of each enzyme.

Figure 4.

Detailed EnzyMine screenshot and search results for EC:1.1.1.1 are shown. (A) Front-end search methods based on text query and molecular searching query. (B) Basic enzyme information is listed, where EC classification, PROSITE and CATH classifications are shown on the left. Various display options related to protein 3D structures are provided by the NGL engine (27). (C) Multi-sequence alignment and phylogenetic trees related to the UniProt sequence under each the EC number are provided in a link, and single UniProt details are included by clicking the button on the right. (D) Reaction centre display, rules and core-to-core are shown. Detailed enzyme information and feature mining information are available for download.

Core-to-core analysis

The number of reactions of seven enzymes and the analysis of core-to-core types are shown in Figure 5. The most frequent core-to-core type occurs in oxidoreductases, showing the process that NAD(+) takes hydrogen and turns into NADH in oxidoreductases. Transferases have the broadest range of the core-to-core type. Among these, the most numerous changes catalyse the transfer of glucuronic acid from Uridine disphosphate(UDP) to other (usually hydrophobic) molecules, which are quite common in membrane proteins to change the water solubility of the receptor molecules and promote the export of these molecules. In particular, a reaction with the rule may not have a core-to-core type. This phenomenon is relatively common in type 2 enzymes. For example, the reaction with Rhea ID 21824 belongs to transferase with EC number 2.6.1.1. It involves the conversion of side chains on the same skeleton, and core-to-core does not change before and after the reaction. All core-to-core data are available on the EnzyMine download page.

Figure 5.

Detailed reaction amount and core-to-core type in seven types of enzymes, the most frequent amount marked in the parentheses and core-to-core pictures are listed on the right of the bar graph. EC1: Oxidoreductases, EC2: Transferases, EC3: Hydrolases, EC4: Lyases, EC5: Isomerases, EC6: Ligases, EC7: Translocases.

Enzyme annotation integrations from sequence to reaction

Extended enzyme-number prediction method based on the core-to-core strategy is also available in EnzyMine as a simple application of those chemical feature data. EnzyMine can predict EC numbers in enzymatic reactions using the logic that different molecules with similar surroundings and the same reactive sites will share the same chemical transformations (28). It allows users to search for and annotate unannotated reactions. To enrich this annotation function for both sequence and reaction, a one-stop annotation integration is offered in the unfolded annotation tools on the EnzyMine homepage. EnzyMine utilizes the HMMER3 algorithm in FunFHMMer to complete the Gene Ontology annotation for protein sequences queries via a web server (29, 30). Bio2Rxn, a user-friendly platform for automatic reaction annotation based on protein sequences, has been integrated into EnzyMine ue to its consensus strategy and high precision in EC number prediction (31). It adopts the voting strategy to achieve high coverage and accuracy, which guarantees the ab initio understanding of enzymes and will expand the sequence annotation.

Discussion

The collection of scattered feature data and supplementation of missing information are the direction in which database optimization is headed. Enzyme dataset provides the raw materials needed for insight into biological systems, but its potential can only be realized through high-level analysis. This would include feature exploration of unannotated sequences and reactions that lack protein sequence or identity. Textual description of the biochemical reaction in UniProt is now being replaced by the reaction in Rhea; high-quality biochemical information addition in enzyme annotation gradually shows its trend (32).

Conventional databases have, therefore, been expanded beyond sequence and structure data to include reaction centre visualization, reaction rule extraction and core-to-core analysis. Using current reaction data in EnzyMine, the prediction accuracy of returned EC number output can reach 92%, which can be improved further by a combination with enzyme sequence and structure data and prove the validity of chemical information. Reaction feature has the potential to provide more high-quality and low-latitude features for the feature extraction process of machine learning and is expected to improve the prediction accuracy of function annotation. In future studies, on the one hand, EnzyMine will continue to enrich data resources in line with data development. On the other hand, we will apply this high-valued chemical information to make contributions for further expansion of enzyme functional annotation.

Conclusions

EnzyMine includes compiled and updated data integration in terms of enzyme and replenishes reaction chemical features. Reaction feature data in this study provide the visualization of reaction centre, a summary of reaction rules and analysis of core-to-core, which contributes to a deeper understanding of enzymatic reactions and will offer researchers a clearer view of biochemical processes and enzymative reactions that mediate them. In summary, the overall enzyme feature mining and holistic integration provide advanced and integrated resources to address enzyme function annotation.

Funding

National Key Research and Development Programme of China (grant number 2019YFA0904300, 2018YFA0900700, 2017YFC1601702); National Natural Science Foundation of China (grant number 31700081, 31570092); Scientific Research Conditions and Technical Support System Programme (grant number KFJ-BRP-009); the CAS STS programme (grant number QYZDB-SSW-SMC012); International Partnership Programme of Chinese Academy of Sciences of China (grant number 153D31KYSB20170121) and the Natural Science Foundation of Tianjin (15JCYBJC54300).

Conflicts of interest. Authors declare that there were no conflicts of interest.

References

Machado

Andrejev

Tramontano

et al. (

2018

)

Fast automated reconstruction of genome-scale metabolic models for microbial species and communities

Nucleic Acids Res.

7542

–

7553

Zhang

Freddolino

P.L.

and

Zhang

(

2017

)

COFACTOR: improved protein function prediction by combining structure, sequence and protein-protein interaction information

Nucleic Acids Res.

W291

–

W299

Kumar

and

Skolnick

(

2012

)

EFICAz2.5: application of a high-precision enzyme function predictor to 396 proteomes

Bioinformatics

2687

–

2688

Wang

Umarov

et al. (

2018

)

DEEPre: sequence-based enzyme EC number prediction by deep learning

Bioinformatics

760

–

769

Dawson

N.L.

Lewis

T.E.

Das

et al. (

2017

)

CATH: an expanded resource to predict protein function through structure and sequence

Nucleic Acids Res.

D289

–

D295

Sigrist

C.J.

de Castro

Cerutti

et al. (

2013

)

New and continuing developments at PROSITE

Nucleic Acids Res.

D344

–

D347

Ribeiro

A.J.M.

Holliday

G.L.

Furnham

et al. (

2018

)

Mechanism and Catalytic Site Atlas (M-CSA): a database of enzyme reaction mechanisms and active sites

Nucleic Acids Res.

D618

–

D623

Andreeva

Howorth

Chothia

et al. (

2014

)

SCOP2 prototype: a new approach to protein structure mining

Nucleic Acids Res.

D310

–

D314

Bateman

Martin

M.J.

O’Donovan

et al. (

2015

)

UniProt: a hub for protein information

Nucleic Acids Res.

D204

–

D212

PubMed

10.

Berman

H.M.

Westbrook

Feng

et al. (

2000

)

The Protein Data Bank

Nucleic Acids Res.

235

–

242

11.

Watanabe

Murata

Ogawa

et al. (

2020

)

Exploration and evaluation of machine learning-based models for predicting enzymatic reactions

J. Chem. Inf. Model

1833

–

1843

12.

Dalby

A.R.

Q.-N.

Zhu

et al. (

2012

)

Assignment of EC numbers to enzymatic reactions with reaction difference fingerprints

PLoS One

, e52901.

13.

Yamanishi

Hattori

Kotera

et al. (

2009

)

E-zyme: predicting potential EC numbers from the chemical transformation pattern of substrate-product pairs

Bioinformatics

i179

–

i186

14.

Moriya

Shigemizu

Hattori

et al. (

2010

)

PathPred: an enzyme-catalyzed metabolic pathway prediction server

Nucleic Acids Res.

W138

–

W143

15.

McDonald

A.G.

Boyce

and

Tipton

K.F.

(

2009

)

ExplorEnz: the primary source of the IUBMB enzyme list

Nucleic Acids Res.

D593

–

D597

16.

Nagano

Nakayama

Ikeda

et al. (

2014

)

EzCatDB: the enzyme reaction database, 2015 update

Nucleic Acids Res.

D453

–

D458

17.

Schomburg

Jeske

Ulbrich

et al. (

2017

)

The BRENDA enzyme information system–from a database to an expert system

J. Biotechnol.

261

194

–

206

18.

Hatzimanikatis

Ionita

J.A.

et al. (

2005

)

Exploring the diversity of complex metabolic networks

Bioinformatics

1603

–

1609

19.

Jeffryes

J.G.

Colastani

R.L.

Elbadawi-Sidhu

et al. (

2015

)

MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics

J. Cheminform.

, 44.

20.

Hadadi

Hafner

Shajkofci

et al. (

2016

)

ATLAS of Biochemistry: a repository of all possible biochemical reactions for synthetic biology and metabolic engineering studies

ACS Synth. Biol.

1155

–

1166

21.

Delepine

Duigou

Carbonell

et al. (

2018

)

RetroPath2.0: a retrosynthesis workflow for metabolic engineers

Metab. Eng.

158

–

170

22.

Duigou

Du Lac

Carbonell

et al. (

2019

)

RetroRules: a database of reaction rules for engineering biology

Nucleic Acids Res.

D1229

–

D1235

23.

Cheng

Sun

Zhang

et al. (

2020

)

RxnBLAST: molecular scaffold and reactive chemical environment feature extractor for biochemical reactions

Bioinformatics

2946

–

2947

24.

Bajusz

Racz

and

Heberger

(

2015

)

Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?

J. Cheminform.

, 20.

25.

Cao

Jiang

and

Girke

(

2008

)

A maximum common substructure-based algorithm for searching and predicting drug-like compounds

Bioinformatics

i366

–

i374

26.

Tian

Yuan

et al. (

2020

)

BCSExplorer: a customized biosynthetic chemical space explorer with multifunctional objective function analysis

Bioinformatics

1642

–

1643

27.

Rose

A.S.

and

Hildebrand

P.W.

(

2015

)

NGL Viewer: a web application for molecular visualization

Nucleic Acids Res.

W576

–

W579

28.

Rahman

S.A.

Cuesta

S.M.

Furnham

et al. (

2014

)

EC-BLAST: a tool to automatically search and compare enzyme reactions

Nat. Methods.

171

–

174

29.

Eddy

S.R.

(

2011

)

Accelerated profile HMM searches

PLoS Comput. Biol.

, e1002195.

30.

Das

Sillitoe

Lee

et al. (

2015

)

CATH FunFHMMer web server: protein functional annotations using functional family assignments

Nucleic Acids Res.

W148

–

W153

31.

Zhang

Tian

Yuan

et al. (

2020

)

Bio2Rxn: sequence-based enzymatic reaction predictions by a consensus strategy

Bioinformatics

3600

–

3601

32.

Morgat

Lombardot

Coudert

et al. (

2019

)

Enzyme annotation in UniProtKB using Rhea

Bioinformatics

1896

–

1901

Crossref