Abstract

The number of biological databases is growing rapidly, but different databases use different identifiers (IDs) to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed MantaID, a data-driven, machine learning–based approach that automates identifying IDs on a large scale. The MantaID model’s prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within 2 min. MantaID supports the discovery and exploitation of ID from large quantities of databases (e.g. up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application and application programming interfaces were also developed for MantaID to improve applicability. To our knowledge, MantaID is the first tool that enables an automatic, quick, accurate and comprehensive identification of large quantities of IDs and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.

Key points
  • MantaID is a data-driven, machine learning–based method that automatically identifies IDs with high accuracy and efficiency and at a large scale.

  • The accuracy of MantaID is confirmed using common statistical metrics.

  • A novel metric method is devised to verify the performance of MantaID.

  • MantaID is implemented as an R package, as well as a web app and application programming interface for easy use.

Introduction

Identifiers (IDs) are used in databases to index and code biological data. As of January 2022, there were 1645 databases and approximately 1700 registered ID nomenclatures (1, 2). IDs are required for simple access to biological data and for facilitating cross-referencing between databases. However, each database has its own representation and a set of ID numbers for identifying biological components (3–9), indicating that IDs from different databases may overlap, that is, the same biological entity may have various IDs (10). For example, a molecule can possess both an Entrez ID (11) and an Ensembl ID (12, 13); Ring Finger Protein 180 is represented by a variety of IDs, including HGNC ID 27752, an Entrez ID 285671, an ENSG00000164197 Ensembl ID, OMIM ID 616015, etc. We observed that different databases tend to employ distinct naming conventions. The first three digits of ID in the Ensembl database, for example, begin with ‘ENS’; the fourth digit of ‘G’ is for gene, ‘T’ is for transcript and ‘P’ is for protein; and then the ID ends with number; in the Entrez gene database, pure numbers are used as gene IDs, beginning with ‘NM’ for transcript number, ‘NP’ for protein number and ‘NR’ for non-coding RNA number; a letter plus a number is used in the UniProt database. In contrast, Kyoto Encyclopedia of Genes and Genomes IDs are composed of a capital letter followed by five digits, while the GO database uses a combination of letters, numbers and underscores. In addition, these IDs may be temporary, which require modification or replacement when new functions for the molecules are revealed. The exchange of information between multiple databases is typically accomplished via mappings between distinct IDs, which has been a cause for concern.

Several ID conversion services, such as UniProt Mapping (14), DAVID (15), BioMart (16), TogoID (17) and GeneToList (18), have been developed to solve this issue. These ID conversion tools enable ID–ID mapping to convert a gene or gene product from one type to another (19). In addition, these tools also implement special features, such as TogoID (17), which can disambiguate and transform IDs. However, they all require previous knowledge of the database to which they belong and are incapable of identifying the IDs in the absence of database names. Therefore, a tool that can automatically construct cross-references between different databases without requiring knowledge of the database names is needed. In this paper, we describe the MantaID tool, which identifies and classifies unknown IDs quickly and precisely by automatically creating ID mappings across multiple databases. This differs from the current ID conversion programs, which rely on ID mappings between databases and only support a limited number of ID types. To our knowledge, MantaID is the first tool for the identification of IDs using machine learning algorithms, which were often used to be applied in various biological applications such as genomic sequence analysis and annotation of proteomics or metabolomics (20).

The computational framework and all the approaches of MantaID are implemented as a software package that handles all the different steps of the model development process and makes it easy to create user-defined ID recognition models by adjusting a few parameters. To demonstrate the usability of MantaID, we have also developed a user-friendly web application that demonstrates the framework approach and workflow for automated ID recognition and enables users to recognize multiple IDs without delving into the model implementation specification. In addition, we provide application programming interface (API) access so that users can launch complex queries programmatically.

Materials and Methods

For easy reference, we summarize the mathematical notations used throughout this paper in Table 1.

Table 1.

Mathematical notations and symbols used in this paper

ParametersDefinitions
DAtrain dataframe with label and features columns
NA dataframe for forecasting with feature columns to predict
KA dataframe with feature columns and predict column
|${s_{{\rm{max}}}}$|Actual budget for a single hyperparameter configuration
BThe total budget
nThe number of parameter configurations
rThe actual budget for a single hyperparameter configuration
TA grouping of parameter configurations
|${n_i}$|Number of bracket configurations
|${r_i}$|Resource allocation
LThe validation loss of configuration t
RMaximum number of resources
|$\eta $|The proportion of parameter configurations ‘advances’ to the next round in hyperband tuning
|${G_i}$|The Gini index of the |${i_{{\rm{th}}}}$| feature
|${\alpha _{{\rm{best}}}}$|Feature that minimizes |${G_i}$|
|${D_{{\rm{subs}}}}$|Induced sub-datasets from |$D$| divided by |${\alpha _{{\rm{best}}}}$|
|${{\rm{Z}}^{\rm{*}}}$|D bootstrap samples
|${\rm{Tre}}{{\rm{e}}_{\rm{b}}}{\rm{/Tre}}{{\rm{e}}_{\rm{t}}}$|A weak tree learner
|${e_{\rm{b}}}$|The rate of out-of-bag (oob) error
|${F_{\rm{b}}}$|A small subset of features
|${\rm{Forest}}$|A strong learner made up of weak tree learners
|${g_{ti}}$|The ith node’s first derivative in round t
|${h_{ti}}$|The ith node’s second derivative in round t
|${G_t}$|The sum of the first derivatives
|${H_t}$|The sum of the second derivatives
|${G_{\rm{L}}}$|The sum of the left subtree’s first derivatives
|${H_{\rm{R}}}$|The sum of the right subtree’s second derivatives
|${G_{\rm{R}}}$|The sum of the right subtree’s first derivatives
|${H_{\rm{L}}}$|The sum of the second derivatives of the left subtree
|$\gamma $|The regularization coefficient governs the number of leaf nodes’ complexity
|$\lambda $|Regularization coefficients that govern the L1–L2 mix
|${O_j}$|The value of neuron unit output
|${w_{ij}}$|Layer i and layer j weight matrix
|${\theta _i}$|The bias of the |${i_{{\rm{th}}}}$| neuron
ParametersDefinitions
DAtrain dataframe with label and features columns
NA dataframe for forecasting with feature columns to predict
KA dataframe with feature columns and predict column
|${s_{{\rm{max}}}}$|Actual budget for a single hyperparameter configuration
BThe total budget
nThe number of parameter configurations
rThe actual budget for a single hyperparameter configuration
TA grouping of parameter configurations
|${n_i}$|Number of bracket configurations
|${r_i}$|Resource allocation
LThe validation loss of configuration t
RMaximum number of resources
|$\eta $|The proportion of parameter configurations ‘advances’ to the next round in hyperband tuning
|${G_i}$|The Gini index of the |${i_{{\rm{th}}}}$| feature
|${\alpha _{{\rm{best}}}}$|Feature that minimizes |${G_i}$|
|${D_{{\rm{subs}}}}$|Induced sub-datasets from |$D$| divided by |${\alpha _{{\rm{best}}}}$|
|${{\rm{Z}}^{\rm{*}}}$|D bootstrap samples
|${\rm{Tre}}{{\rm{e}}_{\rm{b}}}{\rm{/Tre}}{{\rm{e}}_{\rm{t}}}$|A weak tree learner
|${e_{\rm{b}}}$|The rate of out-of-bag (oob) error
|${F_{\rm{b}}}$|A small subset of features
|${\rm{Forest}}$|A strong learner made up of weak tree learners
|${g_{ti}}$|The ith node’s first derivative in round t
|${h_{ti}}$|The ith node’s second derivative in round t
|${G_t}$|The sum of the first derivatives
|${H_t}$|The sum of the second derivatives
|${G_{\rm{L}}}$|The sum of the left subtree’s first derivatives
|${H_{\rm{R}}}$|The sum of the right subtree’s second derivatives
|${G_{\rm{R}}}$|The sum of the right subtree’s first derivatives
|${H_{\rm{L}}}$|The sum of the second derivatives of the left subtree
|$\gamma $|The regularization coefficient governs the number of leaf nodes’ complexity
|$\lambda $|Regularization coefficients that govern the L1–L2 mix
|${O_j}$|The value of neuron unit output
|${w_{ij}}$|Layer i and layer j weight matrix
|${\theta _i}$|The bias of the |${i_{{\rm{th}}}}$| neuron
Table 1.

Mathematical notations and symbols used in this paper

ParametersDefinitions
DAtrain dataframe with label and features columns
NA dataframe for forecasting with feature columns to predict
KA dataframe with feature columns and predict column
|${s_{{\rm{max}}}}$|Actual budget for a single hyperparameter configuration
BThe total budget
nThe number of parameter configurations
rThe actual budget for a single hyperparameter configuration
TA grouping of parameter configurations
|${n_i}$|Number of bracket configurations
|${r_i}$|Resource allocation
LThe validation loss of configuration t
RMaximum number of resources
|$\eta $|The proportion of parameter configurations ‘advances’ to the next round in hyperband tuning
|${G_i}$|The Gini index of the |${i_{{\rm{th}}}}$| feature
|${\alpha _{{\rm{best}}}}$|Feature that minimizes |${G_i}$|
|${D_{{\rm{subs}}}}$|Induced sub-datasets from |$D$| divided by |${\alpha _{{\rm{best}}}}$|
|${{\rm{Z}}^{\rm{*}}}$|D bootstrap samples
|${\rm{Tre}}{{\rm{e}}_{\rm{b}}}{\rm{/Tre}}{{\rm{e}}_{\rm{t}}}$|A weak tree learner
|${e_{\rm{b}}}$|The rate of out-of-bag (oob) error
|${F_{\rm{b}}}$|A small subset of features
|${\rm{Forest}}$|A strong learner made up of weak tree learners
|${g_{ti}}$|The ith node’s first derivative in round t
|${h_{ti}}$|The ith node’s second derivative in round t
|${G_t}$|The sum of the first derivatives
|${H_t}$|The sum of the second derivatives
|${G_{\rm{L}}}$|The sum of the left subtree’s first derivatives
|${H_{\rm{R}}}$|The sum of the right subtree’s second derivatives
|${G_{\rm{R}}}$|The sum of the right subtree’s first derivatives
|${H_{\rm{L}}}$|The sum of the second derivatives of the left subtree
|$\gamma $|The regularization coefficient governs the number of leaf nodes’ complexity
|$\lambda $|Regularization coefficients that govern the L1–L2 mix
|${O_j}$|The value of neuron unit output
|${w_{ij}}$|Layer i and layer j weight matrix
|${\theta _i}$|The bias of the |${i_{{\rm{th}}}}$| neuron
ParametersDefinitions
DAtrain dataframe with label and features columns
NA dataframe for forecasting with feature columns to predict
KA dataframe with feature columns and predict column
|${s_{{\rm{max}}}}$|Actual budget for a single hyperparameter configuration
BThe total budget
nThe number of parameter configurations
rThe actual budget for a single hyperparameter configuration
TA grouping of parameter configurations
|${n_i}$|Number of bracket configurations
|${r_i}$|Resource allocation
LThe validation loss of configuration t
RMaximum number of resources
|$\eta $|The proportion of parameter configurations ‘advances’ to the next round in hyperband tuning
|${G_i}$|The Gini index of the |${i_{{\rm{th}}}}$| feature
|${\alpha _{{\rm{best}}}}$|Feature that minimizes |${G_i}$|
|${D_{{\rm{subs}}}}$|Induced sub-datasets from |$D$| divided by |${\alpha _{{\rm{best}}}}$|
|${{\rm{Z}}^{\rm{*}}}$|D bootstrap samples
|${\rm{Tre}}{{\rm{e}}_{\rm{b}}}{\rm{/Tre}}{{\rm{e}}_{\rm{t}}}$|A weak tree learner
|${e_{\rm{b}}}$|The rate of out-of-bag (oob) error
|${F_{\rm{b}}}$|A small subset of features
|${\rm{Forest}}$|A strong learner made up of weak tree learners
|${g_{ti}}$|The ith node’s first derivative in round t
|${h_{ti}}$|The ith node’s second derivative in round t
|${G_t}$|The sum of the first derivatives
|${H_t}$|The sum of the second derivatives
|${G_{\rm{L}}}$|The sum of the left subtree’s first derivatives
|${H_{\rm{R}}}$|The sum of the right subtree’s second derivatives
|${G_{\rm{R}}}$|The sum of the right subtree’s first derivatives
|${H_{\rm{L}}}$|The sum of the second derivatives of the left subtree
|$\gamma $|The regularization coefficient governs the number of leaf nodes’ complexity
|$\lambda $|Regularization coefficients that govern the L1–L2 mix
|${O_j}$|The value of neuron unit output
|${w_{ij}}$|Layer i and layer j weight matrix
|${\theta _i}$|The bias of the |${i_{{\rm{th}}}}$| neuron

MantaID framework

A schematic overview of the MantaID framework can be found in Figure 1A. First, the MantaID workflow begins with a data frame containing ID and class, obtained either by connecting to the public database using the ‘mi_get_ID_attr’ and ‘mi_get_ID’ functions or from other sources after preprocessing such as data frame reshaping and invalid data removal by the ‘mi_clean_data’ function. Next, a data frame containing the ID columns is passed into the ‘mi_get_padlen’ and ‘mi_split_col’ functions, which cut the IDs into a single-character vector of maximum length. After that, it returns a wide data frame in the original order of the samples, containing the location features and class of the IDs. Then, all single-character features are converted into numeric types using a fixed mapping and can be used directly for training by calling the ‘mi_to_numer’ function. Prior to training, the ‘mi_balance_data’ function is developed to oversample and undersample the data using the Synthetic Minority Oversampling Technique (SMOTE) (21) and random methods, respectively. Thirty per cent of the unbalanced data is used as the test set, and the remainder as the training set, both of which are returned as a list. In addition to this, model tuning is required. The functions ‘mi_tune_rp’, ‘mi_tune_rg’ and ‘mi_tune_xgb’ use the original dataset to tune the parameter spaces of classification and regression tree (CART), random forest (RF) and extreme gradient boosting (XGBoost), respectively, and then draw the tuning stages plots and return them along with the tuner. Last, the functions ‘mi_train_rp’, ‘mi_train_rg’, ‘mi_train_xgb’ and ‘mi_train_BP’ train models with training sets for CART, RF, XGBoost and back propagation neural network (BPNN), respectively, and validate models with test sets to obtain the trained model and validation results. Finally, confusion matrices (CMs) are calculated and heat maps are plotted using the ‘mi_get_confusion’ and ‘mi_plot_heatmap’ functions. Furthermore, a custom wrapper function ‘mi’ is provided to streamline the implementation of steps of the MantaID workflow. In addition to quick large-scale ID identification based on machine learning approaches, MantaID offers a slower but more comprehensive ID recognition method based on online retrieval. This method covers 542 databases and can provide thorough small-scale ID recognition tasks and be used as a complementary method whenever the users want to, taking advantage of the up-to-date information available in the remote databases. For practical use, the aforementioned framework method has been implemented as an open-source R package called MantaID, and the steps of the construction of a MantaID model for ID identifications are described later.

Figure 1.

Schematic overview of the MantaID tool. (A) The theoretical framework MantaID. (B) The R package functions of MantaID. The wrapper function created by MantaID; a wrapper function ‘mi()’ is created that is used to group the functionalities of MantaID and can be executed to carry out all the steps of the MantaID workflow in a lazy fashion.

MantaID model

Data acquisition

MantaID searches public databases for and downloads ID datasets. At first, the function ‘mi_get_ID_attr’ is used to connect to the Ensembl database via the biomaRt package (22) and retrieve 3374 attributes of the human genome–related dataset in our test (23, 24). MantaID can be applied to other species datasets by modifying the argument ‘dataset’ of the ‘mi_get_ID_attr’ function and supports the use of all datasets listed in the R package biomaRt (22). After the retrieval of data, a filter routine based on regular expressions is implemented, leaving 68 ID-related attributes. Then, the attribute data frame is passed to the ‘mi_get_ID’ function, which returns the list of corresponding datasets from the Ensembl and rebuilds it into a long data frame, obtaining 2 936 411 rows. Twenty-nine datasets that lack ID information are eliminated by manual inspection. Finally, a data frame with ID and class columns and 2 751 478 rows is generated.

Data preprocessing

MantaID converts ID data into the format required by machine learning algorithms . The first step is to get the length of the longest ID using the ‘mi_get_padlen’ function. The ‘mi_split_col’ function then takes the length and the ID data frames as arguments, splits each ID element by character into a vector, fills the length to the maximum length and combines them by row, before returning a wide data frame containing the ID location information. The ‘mi_to_numer’ function then converts the input data frame features into computable numeric type features by constructing a mapping from characters to numbers.

Data balancing

MantaID balances the minority and majority classes in training datasets. A common method is the random sampling method, which balances the model by randomly selecting a minority class sample to add copies to it and a majority class sample to remove copies from it. The limitation of random sampling is that the model’s capacity to generalize may be compromised due to excessive sample duplication (25). Therefore, the SMOTE technique is also used for oversampling, whereas the random method is used for undersampling. The main advantage of using the SMOTE method is avoiding the overfitting caused by undersampling with the random method. MantaID balances data with the ‘mi_balance_data’ function, which takes as an input a data frame that contains unbalanced data, and then performs data balancing on it. Thirty per cent of the original balanced data is used as a test set, and the rest as a training set. The returned results from the function are formatted as a list. In addition to balancing the data, feature filtering is necessary for improving model accuracy when the datasets are typically noisy and contain a large number of irrelevant features.

Feature filtering

MantaID eliminates irrelevant and redundant features by estimating the feature covariance and Gini significance. Since the length of the longest ID determines the number of features included in the processed dataset, it is anticipated that there would be redundant features that need to be screened. Prior to filtering, the ‘mi_plot_cor’ function computes the Pearson correlation coefficient of the features to generate the covariance matrix and plots the heat map with its value as the color depth. Next, the ‘mi_get_importance’ function calculates Gini impurity to indicate the redundancy of the features, and a histogram is presented for it. Finally, low-weighted features are deleted using a threshold method based on covariance and importance. The filtered data are subsequently fed to the machine learning algorithms to generate classification models.

Model selection

MantaID contains four machine learning models for the large-scale and automatic identification of IDs: CART, RF, XGBoost and BPNN.

CART (26, 27) uses a tree structure to classify samples into different categories based on the distribution of features in a specific dimension of the samples. All the features and possible split points in the training set are traversed to find the best splitting feature and best split point. The training dataset is then split into two subsets using the best splitting feature and split point, with the results determined as the left and right subtrees, respectively, and the search is repeated for each subtree. The best splitting feature and best split point of each leaf node are determined repeatedly, allowing each leaf node to be partitioned into left and right subtrees. The pseudocode of the implemented algorithm in MantaID is given in Algorithm 1 (see the Supplementary File).

RF (28, 29) is based on bootstrapping using a small set of features to generate a large number of decision trees, which are then used to classify new data with greater accuracy than a single decision tree. The pseudocode of the RF algorithm is presented in Algorithm 2 (see the Supplementary File).

Based on the gradient boosting decision tree (30), XGBoost (31, 32) is an optimized distributed gradient boosting library that can massively parallelize the boosting tree. The main strength of using XGBoost is in continuously adding trees and performing feature splitting to grow. Each new tree is equivalent to learning a new function that fits the residuals of the previous one. When training is complete, we have k trees, each of which corresponds to a leaf node based on sample characteristics, and the score for each leaf node adds up to the sample’s prediction value. A detailed pseudocode is presented in Algorithm 3 (see the Supplementary File).

The learning process of BPNN is divided into two stages (33, 34): forward signal propagation and backward error propagation. When the actual output of the output layer does not match the desired output in the forward propagation process, the error advances to the backward propagation stage, obtaining the error signal of each unit as a basis for correcting the weights of each unit. The pseudocode of this process is shown in Algorithm 4 (see the Supplementary File).

Model tuning

MantaID uses the hyperband approach to tune hyperparameters for CART, RF and XGBoost before training. Hyperband (35), as an extension of Successive Halving (36), is used to determine the optimized setting of operational parameters. For each set of parameter combinations, the loss value is computed using R package ‘mlr3hyperband’ (37). Following the evaluation of the loss of each parameter combination, only one-third of parameter combinations with the lowest loss values are selected for the next iteration. The aforementioned process is summarized in the pseudocode form in Algorithm 5 (see the Supplementary File).

Parameter configurations for BPNN are tuned using a different approach as follows. BPNN consists of a four-layer fully connected network with an input layer, two hidden layers and an output layer. First, the number of nodes in the input and output layers is set equal to the number of features and categories, while the number of nodes in the hidden layer is fixed at 40 according to some rules of thumb that have been previously described (38). Next, Rectified Linear Unit (Relu) is used as the activation function for the hidden layer instead of sigmoid and tanh because it is less computationally intensive and does not tend to saturate, while Softmax is used for the output layer. Finally, the Adam (39) optimizer is implemented to compute individual adaptive learning rates for different parameters, circumventing the need for hyperparameters tuning. The aforementioned process is described in Algorithm 4 (see the Supplementary File).

Model training

Balanced datasets are used for training. To begin the process, the training and test sets are accepted as parameters by functions ‘mi_train_rp’, ‘mi_train_rg’ and ‘mi_train_xgb’ in order to train and validate CART, RF and XGBoost models. After the CMs of the validating results are calculated and plotted as heat maps, trained models are returned as a list. For BPNN, the ‘mi_train_BP’ function sets epoch and batch size first equal to 64 and the batch size equal to 128, based on the empirical guidelines in the literature (38), and it also accepts the training and test sets as inputs. Likewise, after training is complete, the CM is returned and plotted as a heat map.

Model unification and scoring

The use of an even number of models makes it impossible to directly derive the final result using the voting method. To resolve this issue, we present a new method for aggregating models, as depicted in Figure 2 and as follows. MantaID uses the voting method when there is a majority class in prediction results; however, when there are scattered opinions, MantaID uses the following scoring formula for evaluation:

Figure 2.

New metrics for aggregating MantaID models. To incorporate the information, we multiply the model’s F1 score metrics by the mismatch rates of other models to calculate the submodel’s score. When the submodels disagree, we assign a score to each result and select the best one.

$${N_{{score}}}{\; = \;F1}_x^{N}\mathop \prod \nolimits_{R \in \neg {N};\;y = {val}\left( R \right)} \left( {c + \left( {1 - c} \right) \cdot P{{\left( {y|x} \right)}^R}} \right){\rm{\;}}$$
(1)
$$\frac{\partial {Score}_{N}}{{\partial P{{\left( {y|x} \right)}^{\neg {N}}}}}{\rm{\; = \;}}1 - c$$
(2)

where |${N_{{score}}}$| is the score of model |${N}$|⁠, |${F1}_x^{N}$| is the F1 score of model |${N}$| for category |$x$|⁠, val(R) is the prediction result of model |${N}$|⁠, P(y/x)N is the probability that model |${N}$| misclassifies |$y$| as |$x$| and |$c$| is a constant value that determines the degree of influence of other models on the score of the current model. The larger the |$c$| value, the lower the bias derivative |$\frac{\partial {Score}_{N}}{{\partial P{{\left( {y|x} \right)}^{{\neg N}}}}}$| and the smaller the effect, according to Equation (2).

Although accuracy is a good indicator of the model’s correct prediction rate of random individuals, it works poorly on unbalanced datasets and is inclined to hide serious classification errors for classes with few samples (40). This problem can be avoided by using F1 score, which is a good balance between accuracy and implementability, reflecting the model’s effectiveness in classifying this class (41); therefore, this evaluation criterion in MantaID is implemented based on the F1 score. In addition, to fully utilize the existing information, we add other models’ misclassification rates when computing a model’s score, in order to avoid being biased in the evaluation. Finally, the model with the highest score (⁠|${N_{{\rm{score}}}}$|⁠) is selected and is then evaluated by recall, precision, accuracy and the F1 score. For convenience, we use the following abbreviations: TP, true positive; FP, false positive; TN, true negative; FN, false negative; Acc, accuracy; Pre, precision; Rec, recall; and F1, F1 score.

$${\rm{Acc\, = \,}}\frac{{{\rm{TP + TN}}}}{{{\rm{TP + FN + FP + TN}}}}$$
(3)
$${\rm{Pre\, = \,}}\frac{{{\rm{TP}}}}{{{\rm{TP + FN}}}}$$
(4)
$${\rm{Rec\, = \,}}\frac{{{\rm{TP}}}}{{{\rm{TP + FP}}}}$$
(5)
$${\rm{\,F1 = \,}}\frac{{{\rm{2TP}}}}{{{\rm{2TP + FN + FP}}}}$$
(6)

MantaID web application

MantaID includes a user-friendly web application for ID identification, which is available free from the website at https://molaison.shinyapps.io/MantaID/. The primary MantaID interface features a search box that lets you input your query and implement the ID identification methods available in MantaID. A more comprehensive, crawler-based algorithm is also adopted by the MantaID web application to improve the accuracy of the ID identification. First, MantaID performs pattern matching with regular expressions obtained from identifiers.org hosted by European Bioinformatics Institute (42) to filter out missing or malformed data. Second, MantaID connects to the Uniform Resource Locators (URLs) of IDs using the ‘httr’ R package (43). An ID is determined as non-existent or inaccessible when the connection yields an error Hypertext Transfer Protocol (HTTP) status code, such as the 404 page-not-found error. Finally, MantaID retrieves and analyzes the text from the database webpages to determine whether an ID does not exist based on the presence of contextual keywords such as ‘failure’ or ‘No correct information’. These steps should be sufficient for determining the existence of IDs and the databases to which they belong, excluding invalid IDs.

To assist new users, example queries and guidelines are provided alongside the search box. As the identification process progresses, each successfully matched database name and pertinent information are returned as a row in the result table, displayed beneath the search box and can be saved and outputted in various file formats. The original names retrieved from the databases are added with modifiers and shown in the same column as ‘name’ to distinguish between the identical entities within databases, enabling an ID query to identify all matched biological entities such as a gene, protein, or transcript (44, 45).

The advanced search option is also provided: (i) the user can specify the maximum time for accessing each entry, (ii) the user can select whether to go directly to the associated database using the provided URL, (iii) the user can specify the type of object indicated by the ID and (iv) the user can select between local (intensified) and global (diversified) search strategies. A batch search tool is supplied to implement the described MantaID methodology for large quantities of unidentifiable ID data files. The batch search results can be formatted and aligned, and data can be outputted for download in a variety of user-specified formats, as well as for reproducing the model predictions.

Results

Performance evaluation of the MantaID Model

We evaluated MantaID on datasets assembled from public databases to demonstrate its superior ability to identify IDs. MantaID was executed to construct an ID identification model using 39 datasets (Table 2). After the data processing steps were completed, the correlation heat map and importance histogram were generated based on the feature covariance matrix and the feature selection results. As shown in Figure 3, the posterior 10 features have low feature importance and low relevance with the target class, which supports our hypothesis that the redundancy is caused by padding IDs; thus, these features were regarded as redundant and deactivated.

Table 2.

Databases and datasets currently available on MantaID model

NameImbalancedBalancedDescription
The Consensus CDS32 71760 736CCDS ID
Conserved Domain Database720468 390CDD ID
ChEMBL403069 342ChEMBL ID
EMBL199 350139 545European Nucleotide Archive ID
Ensembl exon852 763596 934Exon stable ID
Ensembl gene68 01650 146Gene stable ID
Entrez Gene Database22 92763 673NCBI gene (formerly Entrezgene) ID
HAMAP35870 444HAMAP ID
HGNC39 78058 617HGNC ID
HGNC Transcript232 496162 747Transcript name ID
PANTHER23 77563 418PANTHER ID
Interpro17 61265 267Interpro ID
Merops78070 317MEROPS—the Peptidase Database ID
miRBase184669 997miRBase ID
Protein Data Bank48 23956 079PDB ID
Pfam659568 572Pfam ID
pfScan89570 282PROSITE profiles ID
PIRSF94970 266PIRSF ID
PRINTS148370 106Prints ID
Protein490 333343 233INSDC protein ID
Reactome249569 802Reactome gene ID
Refseq mrna62 04651 937RefSeq mRNA ID
Refseq ncrna15 82865 803RefSeq ncRNA ID
Refseq peptide57 21553 386RefSeq peptide ID
Rfam5870 534RFAM ID
Rfam transcript146170 113RFAM transcript name ID
RNAcentral89 72962 810RNAcentral ID
ScanProsite88170 287PROSITE patterns ID
Structure–Function Linkage Database6470 532SFLD ID
SMART102070 245SMART ID
SUPERFAMILY111370 217Superfamily ID
TIGRFAMs59470 373TIGRFAM ID
UCSC226 788158 752UCSC Stable ID
UniProt Archive90 79163 554UniParc ID
Uniprot gene20 43864 420UniProtKB Gene Name symbol
Uniprot isoform24 82563 104UniProtKB isoform ID
Uniprot TrEMBL61 77152 020UniProtKB/TrEMBL ID
Uniprot Swiss-prot19 28764 765UniProtKB/Swiss-Prot ID
WikiGene22 92663 673WikiGene name
NameImbalancedBalancedDescription
The Consensus CDS32 71760 736CCDS ID
Conserved Domain Database720468 390CDD ID
ChEMBL403069 342ChEMBL ID
EMBL199 350139 545European Nucleotide Archive ID
Ensembl exon852 763596 934Exon stable ID
Ensembl gene68 01650 146Gene stable ID
Entrez Gene Database22 92763 673NCBI gene (formerly Entrezgene) ID
HAMAP35870 444HAMAP ID
HGNC39 78058 617HGNC ID
HGNC Transcript232 496162 747Transcript name ID
PANTHER23 77563 418PANTHER ID
Interpro17 61265 267Interpro ID
Merops78070 317MEROPS—the Peptidase Database ID
miRBase184669 997miRBase ID
Protein Data Bank48 23956 079PDB ID
Pfam659568 572Pfam ID
pfScan89570 282PROSITE profiles ID
PIRSF94970 266PIRSF ID
PRINTS148370 106Prints ID
Protein490 333343 233INSDC protein ID
Reactome249569 802Reactome gene ID
Refseq mrna62 04651 937RefSeq mRNA ID
Refseq ncrna15 82865 803RefSeq ncRNA ID
Refseq peptide57 21553 386RefSeq peptide ID
Rfam5870 534RFAM ID
Rfam transcript146170 113RFAM transcript name ID
RNAcentral89 72962 810RNAcentral ID
ScanProsite88170 287PROSITE patterns ID
Structure–Function Linkage Database6470 532SFLD ID
SMART102070 245SMART ID
SUPERFAMILY111370 217Superfamily ID
TIGRFAMs59470 373TIGRFAM ID
UCSC226 788158 752UCSC Stable ID
UniProt Archive90 79163 554UniParc ID
Uniprot gene20 43864 420UniProtKB Gene Name symbol
Uniprot isoform24 82563 104UniProtKB isoform ID
Uniprot TrEMBL61 77152 020UniProtKB/TrEMBL ID
Uniprot Swiss-prot19 28764 765UniProtKB/Swiss-Prot ID
WikiGene22 92663 673WikiGene name
Table 2.

Databases and datasets currently available on MantaID model

NameImbalancedBalancedDescription
The Consensus CDS32 71760 736CCDS ID
Conserved Domain Database720468 390CDD ID
ChEMBL403069 342ChEMBL ID
EMBL199 350139 545European Nucleotide Archive ID
Ensembl exon852 763596 934Exon stable ID
Ensembl gene68 01650 146Gene stable ID
Entrez Gene Database22 92763 673NCBI gene (formerly Entrezgene) ID
HAMAP35870 444HAMAP ID
HGNC39 78058 617HGNC ID
HGNC Transcript232 496162 747Transcript name ID
PANTHER23 77563 418PANTHER ID
Interpro17 61265 267Interpro ID
Merops78070 317MEROPS—the Peptidase Database ID
miRBase184669 997miRBase ID
Protein Data Bank48 23956 079PDB ID
Pfam659568 572Pfam ID
pfScan89570 282PROSITE profiles ID
PIRSF94970 266PIRSF ID
PRINTS148370 106Prints ID
Protein490 333343 233INSDC protein ID
Reactome249569 802Reactome gene ID
Refseq mrna62 04651 937RefSeq mRNA ID
Refseq ncrna15 82865 803RefSeq ncRNA ID
Refseq peptide57 21553 386RefSeq peptide ID
Rfam5870 534RFAM ID
Rfam transcript146170 113RFAM transcript name ID
RNAcentral89 72962 810RNAcentral ID
ScanProsite88170 287PROSITE patterns ID
Structure–Function Linkage Database6470 532SFLD ID
SMART102070 245SMART ID
SUPERFAMILY111370 217Superfamily ID
TIGRFAMs59470 373TIGRFAM ID
UCSC226 788158 752UCSC Stable ID
UniProt Archive90 79163 554UniParc ID
Uniprot gene20 43864 420UniProtKB Gene Name symbol
Uniprot isoform24 82563 104UniProtKB isoform ID
Uniprot TrEMBL61 77152 020UniProtKB/TrEMBL ID
Uniprot Swiss-prot19 28764 765UniProtKB/Swiss-Prot ID
WikiGene22 92663 673WikiGene name
NameImbalancedBalancedDescription
The Consensus CDS32 71760 736CCDS ID
Conserved Domain Database720468 390CDD ID
ChEMBL403069 342ChEMBL ID
EMBL199 350139 545European Nucleotide Archive ID
Ensembl exon852 763596 934Exon stable ID
Ensembl gene68 01650 146Gene stable ID
Entrez Gene Database22 92763 673NCBI gene (formerly Entrezgene) ID
HAMAP35870 444HAMAP ID
HGNC39 78058 617HGNC ID
HGNC Transcript232 496162 747Transcript name ID
PANTHER23 77563 418PANTHER ID
Interpro17 61265 267Interpro ID
Merops78070 317MEROPS—the Peptidase Database ID
miRBase184669 997miRBase ID
Protein Data Bank48 23956 079PDB ID
Pfam659568 572Pfam ID
pfScan89570 282PROSITE profiles ID
PIRSF94970 266PIRSF ID
PRINTS148370 106Prints ID
Protein490 333343 233INSDC protein ID
Reactome249569 802Reactome gene ID
Refseq mrna62 04651 937RefSeq mRNA ID
Refseq ncrna15 82865 803RefSeq ncRNA ID
Refseq peptide57 21553 386RefSeq peptide ID
Rfam5870 534RFAM ID
Rfam transcript146170 113RFAM transcript name ID
RNAcentral89 72962 810RNAcentral ID
ScanProsite88170 287PROSITE patterns ID
Structure–Function Linkage Database6470 532SFLD ID
SMART102070 245SMART ID
SUPERFAMILY111370 217Superfamily ID
TIGRFAMs59470 373TIGRFAM ID
UCSC226 788158 752UCSC Stable ID
UniProt Archive90 79163 554UniParc ID
Uniprot gene20 43864 420UniProtKB Gene Name symbol
Uniprot isoform24 82563 104UniProtKB isoform ID
Uniprot TrEMBL61 77152 020UniProtKB/TrEMBL ID
Uniprot Swiss-prot19 28764 765UniProtKB/Swiss-Prot ID
WikiGene22 92663 673WikiGene name
Figure 3.

Validation of the MantaID model performance. (A and B) The result of features selection. (A) Correlation heat map. Positive values mean positive correlation; negative values mean negative correlation, as evaluated by Pearson's correlation test. (B) Features importance computed by RF. The horizontal coordinate is the Gini impurity, an indicator for evaluating importance, and the vertical coordinate is the feature. Stage plot for Hyperband tuning of (C) CART, (D) RF and (E) XGBoost. Each line or point represents a set of related parameters, and Hyperband algorithm discards configurations with a percentage of |${\rm{\;}}\frac{1}{\eta }$| to cut training time. Notably, the cart model’s polygon line appears to be stagnating as a result of the minimal accuracy change between stages when compared to the span.

Then, the ratio of the largest majority class to the smallest minority class was used to measure the imbalance degree. According to Table 2, the ratio for the original dataset is about 14 702:1, indicating that the data are extremely imbalanced. After completing the data balancing steps, the ratio is reduced to approximately 12:1, suggesting that the data imbalance is significantly reduced. After balancing the data, the three models of CART, RF and XGBoost were tuned using Hyperband methods, with|${\rm{\;}}\eta $| set to 3, leaving only one-third of the possible hyperparameter combinations for each of the four stages. In total, 49 parameter combinations were tuned for all stages in the parameter spaces of the three models, as shown in Table 3. The results of the parameter tuning for all stages are shown in Figure 3. The parameter combination with the lowest loss value in the fourth stage was regarded the most robust and was chosen for each model.

Table 3.

Parameter configuration for CART, RF, XGBoost and BPNN

ModelClassification TreeRFXGBoostBack Propagation
Complexity parameter0.00053
Maximum depth of tree243688
Minimum observations in a node4
Number of cross-validations0
Number of competitor splits retained3
Number of decision trees385
Criteria for fragmentation‘gini’
Minprop0.017
Evaluation with the off-bag sampleTRUE
Importance‘impurity’
Eta0.29
Regularization factor0.014
Proportion of random sampling0.84
Iterative model‘gbtree’
Minimum loss function descent value0
Regularization term of weight0.92
Number of passes10
Column sampling0.99
Iterations64
Proportion of training set as the test set0.3
Loss function‘Categorical_crossentropy’
Number of samples per workout128
Optimizer‘Adam’
ModelClassification TreeRFXGBoostBack Propagation
Complexity parameter0.00053
Maximum depth of tree243688
Minimum observations in a node4
Number of cross-validations0
Number of competitor splits retained3
Number of decision trees385
Criteria for fragmentation‘gini’
Minprop0.017
Evaluation with the off-bag sampleTRUE
Importance‘impurity’
Eta0.29
Regularization factor0.014
Proportion of random sampling0.84
Iterative model‘gbtree’
Minimum loss function descent value0
Regularization term of weight0.92
Number of passes10
Column sampling0.99
Iterations64
Proportion of training set as the test set0.3
Loss function‘Categorical_crossentropy’
Number of samples per workout128
Optimizer‘Adam’

The MantaID model uses Hyperband to tune the parameters of the first three algorithms

Table 3.

Parameter configuration for CART, RF, XGBoost and BPNN

ModelClassification TreeRFXGBoostBack Propagation
Complexity parameter0.00053
Maximum depth of tree243688
Minimum observations in a node4
Number of cross-validations0
Number of competitor splits retained3
Number of decision trees385
Criteria for fragmentation‘gini’
Minprop0.017
Evaluation with the off-bag sampleTRUE
Importance‘impurity’
Eta0.29
Regularization factor0.014
Proportion of random sampling0.84
Iterative model‘gbtree’
Minimum loss function descent value0
Regularization term of weight0.92
Number of passes10
Column sampling0.99
Iterations64
Proportion of training set as the test set0.3
Loss function‘Categorical_crossentropy’
Number of samples per workout128
Optimizer‘Adam’
ModelClassification TreeRFXGBoostBack Propagation
Complexity parameter0.00053
Maximum depth of tree243688
Minimum observations in a node4
Number of cross-validations0
Number of competitor splits retained3
Number of decision trees385
Criteria for fragmentation‘gini’
Minprop0.017
Evaluation with the off-bag sampleTRUE
Importance‘impurity’
Eta0.29
Regularization factor0.014
Proportion of random sampling0.84
Iterative model‘gbtree’
Minimum loss function descent value0
Regularization term of weight0.92
Number of passes10
Column sampling0.99
Iterations64
Proportion of training set as the test set0.3
Loss function‘Categorical_crossentropy’
Number of samples per workout128
Optimizer‘Adam’

The MantaID model uses Hyperband to tune the parameters of the first three algorithms

Next, the balance effect was assessed by training the model with the optimal set of parameters using both the balanced and unbalanced training datasets. The assessment results were presented as heat maps representing the CMs (Figure 4). The diagonal numbers in the CMs were used to compare models trained on the balanced and unbalanced datasets, because a change in the model’s specificity was a better outcome measure for qualifying the results of minority classes in both the balanced and unbalanced datasets than the overall accuracy. Our results show that, before balancing, CART and RF misclassified nearly all minority classes, XGBoost misclassified only a few minority classes and BPNN correctly classified almost all the minority classes. After balancing, all the four models almost perfectly classified the minority classes, indicating that MantaID effectively constructed a robust classifier when learning from a large quantity of unbalanced ID datasets.

Figure 4.

Heat maps of the CMs for models. CART, RF, XGBoost and BPNN, which were trained on both balanced and unbalanced data, are included. The number of truth-prediction pairs is shown by the value in the box. The more the model is accurate, the more the values are concentrated on the diagonal. Through comparing models with and without balancing samples, we discovered that while accuracy did not noticeably improve as a result of balancing datasets, the models performed better for minor classes.

Finally, the performances of our models were compared by using accuracy, precision, recall and F1 scores, as summarized in Table 4. The high recall rates for most ID classes provide confidence for the accurate classifications. However, low precision values were obtained for some minority classes, which is due to the incorrect classification of a small portion of a large number of majority classes into minority classes. Most models failed to accurately predict WikiGene IDs, due to the fact that WikiGene (46) unites multiple data sources, such as UniProt and Entrez, containing overlapping information. What stands out in Table 4 is that the results of integrated model were superior to those of the individual models in almost every category, indicating that the integrated model inherits the advantages of individual models.

Table 4.

Accuracy, precision, recall and F1 score

graphic
graphic
graphic
graphic

The evaluation of the performance of the CART, RF, XGBoost, and BPNN models was conducted based on the scores presented in the table. The scores were used to integrate the outcomes of the models that were applied to the balanced data. Higher F1 score values reflect better performance. Accuracy, precision, recall and F1 score are represented in the table as Acc, Pre, Rec and F1-sco, respectively

Table 4.

Accuracy, precision, recall and F1 score

graphic
graphic
graphic
graphic

The evaluation of the performance of the CART, RF, XGBoost, and BPNN models was conducted based on the scores presented in the table. The scores were used to integrate the outcomes of the models that were applied to the balanced data. Higher F1 score values reflect better performance. Accuracy, precision, recall and F1 score are represented in the table as Acc, Pre, Rec and F1-sco, respectively

Features of MantaID web application

MantaID functions can be used directly via the MantaID’s Shiny application in an easy and reliable way. ManatID contains three main modules (Figure 5): (i) a general search engine; (ii) a more advanced search engine, named as the batch search tool, and (iii) a fully documented API.

Figure 5.

The features of the MantaID web application. The setting panels allow users to configure the basic and advanced settings; basic settings populate settings panels by default, whereas advanced settings enable a more granular control.

A Google-like search engine is provided to allow users to make queries on IDs easily and reliably. ID identification can be carried out across all existing biological databases listed on (identifiers.org) using default, or customizable with advanced options to perform advanced crawler-based, personalized algorithms when the user has partial knowledge or imperfect information about the sources of the unknown IDs.

The batch search tool of MantaID shiny app provides a template comprising five steps to facilitate the large-scale ID identification, as well as guideline and extensive helps for customizing parameters of the MantaID model to pursue better identification efficiency. All results can be aggregated into a single table displayed and can be outputted into various formats for ease of analysis.

The API is provided for interfacing with other applications or tools and allows us to integrate the services provided by MantaID into other workflows. This paves the way for other applications to integrate ID identification into their data processing pipelines.

The advantages of using MantaID shiny app are manifold: (i) it is cost free, platform =-independent, user-friendly and available to any internet-connected user; (ii) it can perform all the methodologies and methods available in MantaID and (iii) user interactions can be restricted to circumvent undesired modifications.

Discussion

In this work, the MantaID was developed based on machine learning approaches to conduct the large-scale identification of unknown and heterogeneous IDs. Besides achieving a good level of accuracy, MantaID can predict thousands of IDs in a few seconds, e.g. in our test, 100 000 of IDs generated by randomly sampling the available ID datasets can be identified in 71 s.

Previous studies created ID mapping by formulating knowledge-based rules based on their understanding of mappings provided by selected databases (14, 15, 17, 18, 47, 48). These tools rely on metadata and annotations provided by databases to link IDs from different databases (49). Common database IDs, such as Ensembl (13) and RefSeq (50), serve as bridges between databases that lack direct linking of the same entity. The linkages were used as ID mappings that must be frequently updated, such as in the case of a recently published tool TogoID (17), which is dependent on manual curation and is updated every 2 weeks. Lack of frequent updating can result in query failures for new IDs. For example, UniProt (14) can support 98 databases for conversion, DAVID (15) only supports 41 databases for conversion and TogoID (17) only supports 48 databases for conversions. In contrast, MantaID employs a series of machine learning models trained based on a large number of database IDs; once the MantaID model derives the rules of ID to database mapping automatically from the training datasets, it uses the automatically generated ID database mapping to perform ID interpretation. Therefore, MantaID does not require human intervention for updates. In addition, the IDs are not unique across different databases and there is no universal agreement on the composition of a database ID, i.e. an artificial, fictitious ID created for testing purpose could pass as a real ID in some databases. Tools present in the literature (15, 17, 18) have quite limited ID conversion capabilities that are primarily dependent on ID and database mappings created by annotations. The ID mappings in these tools are fixed and can only be modified by tool’s maintainers, necessitating a stringent ID validation prior to ID conversions (49). Therefore, these tools can only accept input of IDs specified within their ID-to-database mapping tables. In contrast, MantaID is a machine learning–based tool that can interpolate and impute any IDs supplied by users based on principles derived from probabilistic models. MantaID aims to identify all IDs of existing biological databases; MantaID models are built and trained on a vast amount of data from a variety of databases, so it is possible to find a legitimate use for an ID that was previously thought to be fictitious. We believe that the MantaID approach is better suited for dealing with a growing number of databases, as it generates ID-to-database mappings automatically without the need for human annotation or intervention.

MantaID is a novel hybrid approach combining machine learning–based algorithms and expressive power of regular expressions to capture the variability during the process of ID matching. Regular expressions are a general-purpose string-matching technique that can only be used to expedite the identification of IDs when ID names are constructed according to carefully and precisely defined rules. However, there are no standard rules set for constructing ID names and the ID names can be similar across databases; in most cases, according to our experience, the same regular expressions can match multiple IDs from different databases. For example, on https://identifiers.org/ (42), the same regular expression pattern ‘^[A-Z0-9]+$’ that is defined for Catalogue of Somatic Mutations in Cancer Gene, Bacterial Tyrosine Kinase and DEPhOsphorylation databases can also match the ChEMBL database IDs (with a regexes of ‘^CHEMBL\d+$’). The inefficiency of regular expressions has been encountered and noted in the literature (51, 52). In addition, overly formulated complex regular expressions for ID identifications can exhibit catastrophic backtracking, consuming the majority of the computer’s computing power (53–55). So regular expressions alone are not sufficient enough for identifications of IDs that can be inconsistently or erratically formulated in many databases. On the other hand, besides the use of regular expressions for a global, coarse-grained identification of IDs, MantaID employs machine learning approaches to identify IDs in order to achieve high efficiency and effectiveness. MantaID generates data-driven, recursive models that can be automatically trained and improved by adding more datasets.

MantaID can identify IDs without requiring explicit knowledge of database names, which, to our knowledge, is a functionality that none of other tools provide (56, 57). This functionality is expected to facilitate the automation of the data-driven analysis pipelines that involves the translation of unsorted free text words extracted from research papers containing IDs of different fields into biologically relevant information via databases. For example, it has been a difficult task to construct a genome-scale metabolic model that involves merging and processing of various omics data, which are managed by different databases using different IDs (58), into a structured and unified model; it has always required human knowledge of the databases from which the IDs originate, in order to translate and search the IDs using databases, due to the lack of software tools capable of automatically identifying the ID database (59, 60). Now, with the help of MantaID, large amounts of free text in literature can be fed into MantaID to search for their meanings in databases, and based on the organized ID meaning tables, protein interactions, gene–disease associations, etc. can be constructed (61–64).

Conclusion

In summary, MantaID is capable of identifying IDs rapidly and is based on various machine learning approaches that are tailored for high accuracy and efficiency. Due to the data-driven nature of our proposed framework approach, MantaID supports the identification of all types of IDs across diverse databases, thereby avoiding the limitations encountered by a few other ID conversion programs By eliminating the need to manually look up biological IDs in online databases, it is envisioned that MantaID will become an indispensable tool for the creation of large-scale models by assimilating and integrating large quantities of ID data linking all biological knowledge.

Supplementary Material

Supplementary Material is available at Database online.

Data availability

The source files and instruction of API are contained within the R package (https://bitbucket.org/Molaison/mantaid/src/main/).

Conflict of interest

None declared.

Funding

The Fundamental Research Funds for the Central Universities (Hunan University, No. 531118010599).

References

1.

Zou
D.
,
Ma
L.
,
Yu
J.
et al.  (
2015
)
Biological databases for human research
.
Genom. Proteom. Bioinform.
,
13
,
55
63
.

2.

Rigden
D.J.
and
Fernández
X.M.
(
2022
)
The 2022 nucleic acids research database issue and the online molecular biology database collection
.
Nucleic Acids Res.
,
50
,
D1
D10
.

3.

Fundel
K.
and
Zimmer
R.
(
2006
)
Gene and protein nomenclature in public databases
.
BMC Bioinform.
,
7
,
1
13
.

4.

Griffiths-Jones
S.
,
Grocock
R.J.
,
van Dongen
S.
et al.  (
2006
)
miRBase: microRNA sequences, targets and gene nomenclature
.
Nucleic Acids Res.
,
34
,
D140
D144
.

5.

Schoch
C.L.
,
Ciufo
S.
,
Domrachev
M.
et al.  (
2020
)
NCBI Taxonomy: a comprehensive update on curation, resources and tools
.
Database (Oxford)
,
2020
, baaa062.

6.

Maxam
A.M.
and
Gilbert
W.
(
1977
)
A new method for sequencing DNA
.
Proc. Natl. Acad. Sci.
,
74
,
560
564
.

7.

Mundy
J.
(
1989
)
Developing nomenclature for genes of unknown function: a case study of ABA-responsive genes
.
Plant Mol. Biol. Rep.
,
7
,
276
283
.

8.

Shaklee
J.B.
,
Allendorf
F.W.
,
Morizot
D.C.
et al.  (
1990
)
Gene nomenclature for protein-coding loci in fish
.
Trans. Am. Fish. Soc.
,
119
,
2
15
.

9.

Chandy
K.
(
1991
)
Simplified gene nomenclature
.
Nature
,
352
,
26
26
.

10.

Berriz
G.F.
and
Roth
F.P.
(
2008
)
The Synergizer service for translating gene, protein and other biological identifiers
.
Bioinformatics
,
24
,
2272
2273
.

11.

Maglott
D.
,
Ostell
J.
,
Pruitt
K.D.
et al.  (
2011
)
Entrez Gene: gene-centered information at NCBI
.
Nucleic Acids Res.
,
39
,
D52
D57
.

12.

Yates
A.D.
,
Achuthan
P.
,
Akanni
W.
et al.  (
2019
)
Ensembl 2020
.
Nucleic Acids Res.
,
48
,
D682
D688
.

13.

Howe
K.L.
,
Achuthan
P.
,
Allen
J.
et al.  (
2020
)
Ensembl 2021
.
Nucleic Acids Res.
,
49
,
D884
D891
.

14.

Pundir
S.
,
Martin
M.J.
,
O’Donovan
C.
et al.  (
2016
)
UniProt tools
.
Curr. Protoc. Bioinformatics
,
53
,
1
29
.

15.

Da Wei Huang
B.T.S.
,
Stephens
R.
,
Baseler
M.W.
et al.  (
2008
)
DAVID gene ID conversion tool
.
Bioinformation
,
2
,
428
430
.

16.

Smedley
D.
,
Haider
S.
,
Durinck
S.
et al.  (
2015
)
The BioMart community portal: an innovative alternative to large, centralized data repositories
.
Nucleic Acids Res.
,
43
,
W589
W598
.

17.

Ikeda
S.
,
Ono
H.
,
Ohta
T.
et al.  (
2022
)
TogoID: an exploratory ID converter to bridge biological datasets
.
Bioinformatics
,
38
,
4194
4199
.

18.

Breidenbach
J.D.
,
Begue
E.F.
III
,
Kennedy
D.J.
et al.  (
2022
)
GeneToList: a web application to assist with gene identifiers for the non-bioinformatics-savvy scientist
.
Biology
,
11
, 1113.

19.

Jamil
H.M.
(
2014
)
Improving integration effectiveness of ID mapping based biological record linkage
.
IEEE/ACM Trans. Comput. Biol. Bioinform.
,
12
,
473
486
.

20.

Szklarczyk
D.
,
Franceschini
A.
,
Wyder
S.
et al.  (
2015
)
STRING v10: protein-protein interaction networks, integrated over the tree of life
.
Nucleic Acids Res.
,
43
,
D447
D452
.

21.

Chawla
N.V.
,
Bowyer
K.W.
,
Hall
L.O.
et al.  (
2002
)
SMOTE: Synthetic Minority Over-sampling Technique
.
J. Artif. Intell. Res.
,
16
,
321
357
.

22.

Durinck
S.
,
Spellman
P.T.
,
Birney
E.
et al.  (
2009
)
Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt
.
Nat. Protoc.
,
4
,
1184
1191
.

23.

Lappalainen
T.
,
Scott
A.J.
,
Brandt
M.
et al.  (
2019
)
Genomic analysis in the age of human genome sequencing
.
Cell
,
177
,
70
84
.

24.

Collins
F.S.
and
Fink
L.
(
1995
)
The human genome project
.
Alcohol Health Res. World
,
19
,
190
195
.

25.

Batista
G.E.
,
Bazzan
A.L.
and
Monard
M.C.
(
2003
) Balancing training data for automated annotation of keywords: a case study.
Wob
, II Brazilian Workshop on Bioinformatics, Macaé, RJ, Brazil,
10
18
. https://www.researchgate.net/publication/221322870_Balancing_Training_Data_for_Automated_Annotation_of_Keywords_a_Case_Study.

26.

Therneau
T.
,
Atkinson
B.
and
Ripley
B.
(
2015
)
rpart: Recursive partitioning and regression trees
.
R Package Version
,
4
,
1
9
.

27.

Therneau
T.
,
Atkinson
B.
,
Ripley
B.
et al.  (
2015
)
Package ‘rpart
. cran.ma.ic.ac.uk/web/packages/rpart/rpart.pdf (
20 April 2016, date last accessed
).

28.

Wright
M.N.
and
Ziegler
A.
(
2017
)
ranger: a fast implementation of random forests for high dimensional data in C++ and R
.
J. Stat. Softw.
,
77
,
1
17
.

29.

Breiman
L.
(
2001
)
Random forests
.
Mach. Learn.
,
45
,
5
32
.

30.

Yuan
Y.
,
Li
S.
,
Zhang
X.
et al.  (
2018
)
A comparative analysis of svm, naive bayes and gbdt for data faults detection in wsns
. In: 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).
IEEE
,
Lisbon, Portugal
, pp.
394
399
.

31.

Chen
T.
,
He
T.
,
Benesty
M.
et al.  (
2015
)
Xgboost: extreme gradient boosting
.
R Package Version 0.4-2, 1
,
1
4
.

32.

Chen
T.
and
Guestrin
C.
(
2016
)
Xgboost: a scalable tree boosting system
. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining,
San Francisco, California, USA
, pp.
785
794
.

33.

LeCun
Y.
,
Bengio
Y.
and
Hinton
G.
(
2015
)
Deep learning
.
Nature
,
521
,
436
444
.

34.

Rumelhart
D.E.
,
Hinton
G.E.
and
Williams
R.J.
(
1986
)
Learning representations by back-propagating errors
.
Nature
,
323
,
533
536
.

35.

Li
L.
,
Jamieson
K.
,
DeSalvo
G.
et al.  (
2017
)
Hyperband: a novel bandit-based approach to hyperparameter optimization
.
J. Mach. Learn. Res.
,
18
,
6765
6816
.

36.

Jamieson
K.
and
Talwalkar
A.
(
2016
)
Non-stochastic best arm identification and hyperparameter optimization
. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistic.
PMLR
,
Cadiz, Spain
, pp.
240
248
.

37.

Becker
M.
,
Gruber
S.
,
Richter
J.
et al.  (
2022
)
mlr3hyperband: Hyperband for “mlr3.”
. https://mlr3hyperband.mlr-org.com.

38.

Ke
J.
and
Liu
X.
(
2008
) Empirical analysis of optimal hidden neurons in neural network modeling for stock prediction.
2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application
. Vol.
2
. pp.
828
832
.

39.

Kingma
D.P.
and
Ba
J.
(
2014
)
Adam: a method for stochastic optimization
. In: Proceedings of the 3rd International Conference on Learning Representations.
San Diego, CA, USA
,
May 7–9, 2015
.

40.

Ranawana
R.
and
Palade
V.
(
2006
)
Optimized precision—a new measure for classifier performance evaluation
. In: 2006 IEEE international conference on evolutionary computation.
Vancouver, BC, Canada
, pp.
2254
2261
.

41.

Sasaki
Y.
(
2007
)
The truth of the F-measure
.
Teach Tutor Mater
,
1
,
1
5
.

42.

Bernal-Llinares
M.
,
Ferrer-Gómez
J.
,
Juty
N.
et al.  (
2021
)
Identifiers.org: compact identifier services in the cloud
.
Bioinformatics
,
37
,
1781
1782
.

43.

Wickham
H.
(
2022
)
httr: tools for working with URLs and HTTP
. https://httr.r-lib.org/.

44.

Hoyt
C.T.
,
Balk
M.
,
Callahan
T.J.
et al.  (
2022
)
Unifying the identification of biomedical entities with the Bioregistry
.
Sci. Data
,
9
, 714.

45.

Sharma
P.K.
Yadav
I.S.
(
2022
) Chapter 2—Biological databases and their application. In:
Singh
 
DB
and
Pathak
 
RK
(eds.)
Bioinformatics
.
Academic Press
,
Massachusetts
, pp.
17
31
.

46.

Maier
H.
,
Döhr
S.
,
Grote
K.
et al.  (
2005
)
LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts
.
Nucleic Acids Res.
,
33
,
W779
W782
.

47.

Raudvere
U.
,
Kolberg
L.
,
Kuzmin
I.
et al.  (
2019
)
g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update)
.
Nucleic Acids Res.
,
47
,
W191
W198
.

48.

Mudunuri
U.
,
Che
A.
,
Yi
M.
et al.  (
2009
)
bioDBnet: the biological database network
.
Bioinformatics
,
25
,
555
556
.

49.

Mohammad
F.
,
Flight
R.M.
,
Harrison
B.J.
et al.  (
2012
)
AbsIDconvert: an absolute approach for converting genetic identifiers at different granularities
.
BMC Bioinform.
,
13
, 229.

50.

O’Leary
N.A.
,
Wright
M.W.
,
Brister
J.R.
et al.  (
2015
)
Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation
.
Nucleic Acids Res.
,
44
,
D733
D745
.

51.

Ellul
K.
,
Krawetz
B.
,
Shallit
J.
et al.  (
2005
)
Regular expressions: new results and open problems
.
J. Autom. Lang. Comb.
,
10
,
407
437
.

52.

Profitlich
H.-J.
and
Sonntag
D.
(
2021
)
A case study on pros and cons of regular expression detection and dependency parsing for negation extraction from German Medical Documents
. https://doi.org/10.48550/arXiv.2105.09702 (
7 February 2023, date last accessed
).

53.

Barlas
E.
,
Du
X.
and
Davis
J.C.
(
2022
)
Exploiting input sanitization for regex denial of service
. In: Proceedings of the 44th International Conference on Software Engineering.
ACM
,
Pittsburgh Pennsylvania
, pp.
883
895
.

54.

Berglund
M.
,
Drewes
F.
and
van der Merwe
B.
(
2014
)
Analyzing catastrophic backtracking behavior in practical regular expression matching
.
Electron. Proc. Theor. Comput. Sci.
,
151
,
109
123
.

55.

Thompson
K.
(
1968
)
Programming techniques: regular expression search algorithm
.
Commun. ACM
,
11
,
419
422
.

56.

Przybyła
P.
,
Shardlow
M.
,
Aubin
S.
et al.  (
2016
)
Text mining resources for the life sciences
.
Database
,
2016
, baw145.

57.

Liu
S.
,
Tang
B.
,
Chen
Q.
et al.  (
2015
)
Drug name recognition: approaches and resources
.
Information
,
6
,
790
810
.

58.

Chavan
S.S.
,
Shaughnessy
J.D.
and
Edmondson
R.D.
(
2011
)
Overview of biological database mapping services for interoperation between different “omics” datasets
.
Hum. Genomics
,
5
, 703.

59.

Gerner
M.
,
Nenadic
G.
and
Bergman
C.M.
(
2010
)
LINNAEUS: a species name identification system for biomedical literature
.
BMC Bioinform.
,
11
, 85.

60.

Gu
C.
,
Kim
G.B.
,
Kim
W.J.
et al.  (
2019
)
Current status and applications of genome-scale metabolic models
.
Genome Biol.
,
20
, 121.

61.

Allot
A.
,
Peng
Y.
,
Wei
C.-H.
et al.  (
2018
)
LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC
.
Nucleic Acids Res.
,
46
,
W530
W536
.

62.

Doughty
E.
,
Kertesz-Farkas
A.
,
Bodenreider
O.
et al.  (
2011
)
Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature
.
Bioinformatics
,
27
,
408
415
.

63.

Zhou
J.
and
Fu
B.
(
2018
)
The research on gene-disease association based on text-mining of PubMed
.
BMC Bioinform.
,
19
, 37.

64.

Fleuren
W.W.M.
and
Alkema
W.
(
2015
)
Application of text mining in the biomedical domain
.
Methods
,
74
,
97
106
.

Author notes

contributed equally to this work.

Z.Z. is a BSc student of experimental science class at Hunan University. J.H. is an MSc student in the College of Biology at Hunan University. M.C. is an MSc student in the College of Biology at Hunan University. B.L. is a BSc student in the College of Biology at Hunan University. X.W. is an MSc student in the College of Biology at Hunan University. F.Y. is a professor in the College of Biology, Hunan University. His research interests include (i) receptor-like kinase, (ii) RNA metabolism in the environment and (iii) root plasticity. L.M. is an associate professor in the Department of Pharmacy in the College of Biology at Hunan University. His research interests include (i) using mathematical modeling and big data analysis approaches to solve open biology questions and (ii) developing optimization-based, numerical analysis algorithms and bioinformatics tools for genome-scale metabolic modeling.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data