A BERT-based ensemble learning approach for the BioCreative VII challenges: full-text chemical identification and multi-label classification in PubMed articles

Ensemble biomedical-based BERT models for the NLM-CHEM and LitCovid tracks

	Ensembled models for Track
Explanation	Pre-trained BERT	NLM-CHEM	LitCovid
BioBERT: This is the first biomedical-specific BERT model and was proposed by Lee et al. (28). They adopted BERT for the initialized weights and it was pre-trained on large-scale biomedical corpora, PubMed abstracts and PMC full-text articles. It performs well in a variety of biomedical text mining tasks. For the LitCovid track, we use BioBERT v1.2 (https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), which follows the training process of BioBERT v1.1 but includes an LM head, which can be useful for probing.	biobert-base-cased-v1.2	✕	✓
PubMedBERT: Gu et al. (29) pre-trained this model from scratch using PubMed abstracts with a high batch size (8192), and it showed substantial gains over continual pre-training of general-domain BERT. PubMedBERT achieves state-of-the-art performance on several biomedical NLP tasks, as shown on the Biomedical Language Understanding and Reasoning Benchmark (BLURB) (13). In this research, we adopted PubMedBERT (https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) for both NLM-CHEM and LitCovid tracks.	PubMedBERT	✓	✓
BioM-Transformers: Alrowili and Shanker (17) pre-trained several large biomedical language models using the original implementation of BERT (30), ALBERT (31) and ELECTRA (18). For both NLM-CHEM and LitCovid tracks, we adopted two kinds of BioM-ELECTRA. One is BioM-ELECTRA-Large-Discriminator (https://huggingface.co/sultan/BioM-ELECTRA-Large-Discriminator), which was pre-trained on PubMed abstracts only with a biomedical domain vocabulary of 434 K steps and a batch size of 4096. The other is BioM-ELECTRA-Large-SQuAD2 (https://huggingface.co/sultan/BioM-ELECTRA-Large-SQuAD2), which fine-tuned BioM-ELECTRA-Large on the SQuAD2.0 dataset.	ELECTRA-Large-Discriminator ELECTRA-Large-SQuAD2	✓ ✓	✓ ✓
Bioformer: Chen et al. (32) pre-trained Bioformer on all PubMed abstracts (as of Jan 2021) and 1 million randomly-sampled PubMed Central full-text articles. This model achieved the best performance for the LitCovid track in the BioCreative VII Challenge. In this paper, we adopted bioformer-cased-v1.0 (https://huggingface.co/bioformers/bioformer-cased-v1.0) for both NLM-CHEM and LitCovid tracks. In addition, we used bioformer-cased-v1.0-bc2gm (https://huggingface.co/bioformers/bioformer-cased-v1.0-bc2gm), which was fine-tuned on the BC2GM (33) dataset and is suitable for recognizing entities of genes and proteins.	bioformer-cased-v1.0 bioformer-cased-v1.0-bc2gm	✓ ✓	✓ ✕

	Ensembled models for Track
Explanation	Pre-trained BERT	NLM-CHEM	LitCovid
BioBERT: This is the first biomedical-specific BERT model and was proposed by Lee et al. (28). They adopted BERT for the initialized weights and it was pre-trained on large-scale biomedical corpora, PubMed abstracts and PMC full-text articles. It performs well in a variety of biomedical text mining tasks. For the LitCovid track, we use BioBERT v1.2 (https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), which follows the training process of BioBERT v1.1 but includes an LM head, which can be useful for probing.	biobert-base-cased-v1.2	✕	✓
PubMedBERT: Gu et al. (29) pre-trained this model from scratch using PubMed abstracts with a high batch size (8192), and it showed substantial gains over continual pre-training of general-domain BERT. PubMedBERT achieves state-of-the-art performance on several biomedical NLP tasks, as shown on the Biomedical Language Understanding and Reasoning Benchmark (BLURB) (13). In this research, we adopted PubMedBERT (https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) for both NLM-CHEM and LitCovid tracks.	PubMedBERT	✓	✓
BioM-Transformers: Alrowili and Shanker (17) pre-trained several large biomedical language models using the original implementation of BERT (30), ALBERT (31) and ELECTRA (18). For both NLM-CHEM and LitCovid tracks, we adopted two kinds of BioM-ELECTRA. One is BioM-ELECTRA-Large-Discriminator (https://huggingface.co/sultan/BioM-ELECTRA-Large-Discriminator), which was pre-trained on PubMed abstracts only with a biomedical domain vocabulary of 434 K steps and a batch size of 4096. The other is BioM-ELECTRA-Large-SQuAD2 (https://huggingface.co/sultan/BioM-ELECTRA-Large-SQuAD2), which fine-tuned BioM-ELECTRA-Large on the SQuAD2.0 dataset.	ELECTRA-Large-Discriminator ELECTRA-Large-SQuAD2	✓ ✓	✓ ✓
Bioformer: Chen et al. (32) pre-trained Bioformer on all PubMed abstracts (as of Jan 2021) and 1 million randomly-sampled PubMed Central full-text articles. This model achieved the best performance for the LitCovid track in the BioCreative VII Challenge. In this paper, we adopted bioformer-cased-v1.0 (https://huggingface.co/bioformers/bioformer-cased-v1.0) for both NLM-CHEM and LitCovid tracks. In addition, we used bioformer-cased-v1.0-bc2gm (https://huggingface.co/bioformers/bioformer-cased-v1.0-bc2gm), which was fine-tuned on the BC2GM (33) dataset and is suitable for recognizing entities of genes and proteins.	bioformer-cased-v1.0 bioformer-cased-v1.0-bc2gm	✓ ✓	✓ ✕

Table 1.

Ensemble biomedical-based BERT models for the NLM-CHEM and LitCovid tracks

	Ensembled models for Track
Explanation	Pre-trained BERT	NLM-CHEM	LitCovid
BioBERT: This is the first biomedical-specific BERT model and was proposed by Lee et al. (28). They adopted BERT for the initialized weights and it was pre-trained on large-scale biomedical corpora, PubMed abstracts and PMC full-text articles. It performs well in a variety of biomedical text mining tasks. For the LitCovid track, we use BioBERT v1.2 (https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), which follows the training process of BioBERT v1.1 but includes an LM head, which can be useful for probing.	biobert-base-cased-v1.2	✕	✓
PubMedBERT: Gu et al. (29) pre-trained this model from scratch using PubMed abstracts with a high batch size (8192), and it showed substantial gains over continual pre-training of general-domain BERT. PubMedBERT achieves state-of-the-art performance on several biomedical NLP tasks, as shown on the Biomedical Language Understanding and Reasoning Benchmark (BLURB) (13). In this research, we adopted PubMedBERT (https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) for both NLM-CHEM and LitCovid tracks.	PubMedBERT	✓	✓
BioM-Transformers: Alrowili and Shanker (17) pre-trained several large biomedical language models using the original implementation of BERT (30), ALBERT (31) and ELECTRA (18). For both NLM-CHEM and LitCovid tracks, we adopted two kinds of BioM-ELECTRA. One is BioM-ELECTRA-Large-Discriminator (https://huggingface.co/sultan/BioM-ELECTRA-Large-Discriminator), which was pre-trained on PubMed abstracts only with a biomedical domain vocabulary of 434 K steps and a batch size of 4096. The other is BioM-ELECTRA-Large-SQuAD2 (https://huggingface.co/sultan/BioM-ELECTRA-Large-SQuAD2), which fine-tuned BioM-ELECTRA-Large on the SQuAD2.0 dataset.	ELECTRA-Large-Discriminator ELECTRA-Large-SQuAD2	✓ ✓	✓ ✓
Bioformer: Chen et al. (32) pre-trained Bioformer on all PubMed abstracts (as of Jan 2021) and 1 million randomly-sampled PubMed Central full-text articles. This model achieved the best performance for the LitCovid track in the BioCreative VII Challenge. In this paper, we adopted bioformer-cased-v1.0 (https://huggingface.co/bioformers/bioformer-cased-v1.0) for both NLM-CHEM and LitCovid tracks. In addition, we used bioformer-cased-v1.0-bc2gm (https://huggingface.co/bioformers/bioformer-cased-v1.0-bc2gm), which was fine-tuned on the BC2GM (33) dataset and is suitable for recognizing entities of genes and proteins.	bioformer-cased-v1.0 bioformer-cased-v1.0-bc2gm	✓ ✓	✓ ✕

	Ensembled models for Track
Explanation	Pre-trained BERT	NLM-CHEM	LitCovid
BioBERT: This is the first biomedical-specific BERT model and was proposed by Lee et al. (28). They adopted BERT for the initialized weights and it was pre-trained on large-scale biomedical corpora, PubMed abstracts and PMC full-text articles. It performs well in a variety of biomedical text mining tasks. For the LitCovid track, we use BioBERT v1.2 (https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), which follows the training process of BioBERT v1.1 but includes an LM head, which can be useful for probing.	biobert-base-cased-v1.2	✕	✓
PubMedBERT: Gu et al. (29) pre-trained this model from scratch using PubMed abstracts with a high batch size (8192), and it showed substantial gains over continual pre-training of general-domain BERT. PubMedBERT achieves state-of-the-art performance on several biomedical NLP tasks, as shown on the Biomedical Language Understanding and Reasoning Benchmark (BLURB) (13). In this research, we adopted PubMedBERT (https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) for both NLM-CHEM and LitCovid tracks.	PubMedBERT	✓	✓
BioM-Transformers: Alrowili and Shanker (17) pre-trained several large biomedical language models using the original implementation of BERT (30), ALBERT (31) and ELECTRA (18). For both NLM-CHEM and LitCovid tracks, we adopted two kinds of BioM-ELECTRA. One is BioM-ELECTRA-Large-Discriminator (https://huggingface.co/sultan/BioM-ELECTRA-Large-Discriminator), which was pre-trained on PubMed abstracts only with a biomedical domain vocabulary of 434 K steps and a batch size of 4096. The other is BioM-ELECTRA-Large-SQuAD2 (https://huggingface.co/sultan/BioM-ELECTRA-Large-SQuAD2), which fine-tuned BioM-ELECTRA-Large on the SQuAD2.0 dataset.	ELECTRA-Large-Discriminator ELECTRA-Large-SQuAD2	✓ ✓	✓ ✓
Bioformer: Chen et al. (32) pre-trained Bioformer on all PubMed abstracts (as of Jan 2021) and 1 million randomly-sampled PubMed Central full-text articles. This model achieved the best performance for the LitCovid track in the BioCreative VII Challenge. In this paper, we adopted bioformer-cased-v1.0 (https://huggingface.co/bioformers/bioformer-cased-v1.0) for both NLM-CHEM and LitCovid tracks. In addition, we used bioformer-cased-v1.0-bc2gm (https://huggingface.co/bioformers/bioformer-cased-v1.0-bc2gm), which was fine-tuned on the BC2GM (33) dataset and is suitable for recognizing entities of genes and proteins.	bioformer-cased-v1.0 bioformer-cased-v1.0-bc2gm	✓ ✓	✓ ✕

MeSH ID Normalization Algorithm

INPUT: E = {e₁,, e_c}—a set of all predicted chemical named entities; K = {k₁:v₁,, k_m:v_m}—a set of key-value pairs in the MeSH ID dictionary

BEGIN

1: FOR EACH PREDICTED ENTITY e_r:

2: FOR i = 1 TO m

3: IF e_r == k_i

4: RETURN v_i AS MeSH ID

5: ELSE

6: calculate dist = Levenshtein Distance(e_r, k_i)

7: IF dist ≥ 90

8: RETURN v_i AS MeSH ID

9: ELSE

10: RETURN ‘-’ AS empty value

11: END FOR

12: END FOR EACH PREDICTED ENTITY

END

An edit distance-based entity linking approach for chemical name normalization

In this research, we employ the edit distance algorithm to address the entity linking problem in chemical normalization. At the outset, a collection of MeSH ID and identifications from the dataset were compiled into a knowledge base (dictionary). During prediction, we search for predicted chemical named entities in the dictionary in order to find the correct mapping. In the case of missing entities, we calculated the Levenshtein Distance (34) with a 90% similarity level to obtain the most similar terms and their ID. Finally, if the above process yields no return, we designate the term with a null value. Practically, we implement this with thefuzz (https://github.com/seatgeek/thefuzz) and python-Levenshtein (https://github.com/ztane/python-Levenshtein/) python packages. The chemical named entity normalization algorithm is presented as follows:

There are a set of predicted chemical named entities (⁠|$E$|⁠) to map a set of key-value pairs in the MeSH ID (⁠|$K$|⁠) over several commits (⁠|$n$|⁠), and a total of |$\left( {E \cdot K} \right) \times n$| processes are performed. We employed parallel threading to speed up the search process. The total search time was reduced to within an hour using 20 CPU cores. The current framework was shown to achieve remarkable performance, surpassing baseline as well as the median of all compared methods.

Experiments

Dataset & setting

The NLM-CHEM track uses the NLM-CHEM corpus (35) for the training and development sets. This corpus includes 150 full-text articles with about 5000 unique chemical named entities that are mapped to approximately 2000 MeSH identifiers. The test set is a collection of 50 recently published full-text articles on PubMed, planned to be indexed manually in the year 2021. More specifically, there are 3740 unique chemical strings and 1352 unique MeSH IDs in the test set. The average number of Chemical Annotations per article is 300.4 terms, but there is a minimum of 2 terms and a maximum of 1318 terms The distribution of the number of unique MeSH IDs per article is also similar, with a minimum of 1 and an average of 41, but the largest number of unique MeSH IDs in an article is 127.

The LitCovid track employs the LitCovid corpus (21) for multi-label topic classification of the COVID-19 literature. The training and development sets contain more than 30 000 COVID-19 related articles and the evaluation dataset includes 2500 manually reviewed articles. The abstract and title of an article along with other meta-information, such as DOI, journal name and keywords, may contain one or more labels. The labels include Treatment, Diagnosis, Prevention, Mechanisms, Transmission, Epidemiological Prediction and Case Reporting. Detailed information of the corpora used in both tracks is listed in Table 2.

Table 2.

The data distribution of the NLM-CHEM and LitCovid tracks in the BioCreative VII Challenge

	Training	Development	Test
NLM-CHEM200 Corpus
# of Articles	100	50	54
# of Chemical NE (those with a MeSH ID)	26 567 (26 339)	11 772 (11 660)	22 942 (22 777)
LitCovid Corpus
# of Articles	24 960	6239	2500
# of Prevention	11 102 (44.48%)	2750 (44.08%)	1035 (41.4%)
# of Treatment	8717 (34.2%)	2207 (35.37%)	722 (28.88%)
# of Diagnosis	6193 (24.81%)	1546 (24.78%)	926 (37.04%)
# of Mechanism	4438 (17.78%)	1073 (17.2%)	567 (22.68%)
# of Case report	2063 (8.27%)	482 (7.72%)	128 (5.12%)
# of Transmission	1088 (4.35%)	256 (4.1%)	41 (1.64%)
# of Epidemic forecasting	645 (2.58%)	192 (3.08%)	197 (7.88%)

	Training	Development	Test
NLM-CHEM200 Corpus
# of Articles	100	50	54
# of Chemical NE (those with a MeSH ID)	26 567 (26 339)	11 772 (11 660)	22 942 (22 777)
LitCovid Corpus
# of Articles	24 960	6239	2500
# of Prevention	11 102 (44.48%)	2750 (44.08%)	1035 (41.4%)
# of Treatment	8717 (34.2%)	2207 (35.37%)	722 (28.88%)
# of Diagnosis	6193 (24.81%)	1546 (24.78%)	926 (37.04%)
# of Mechanism	4438 (17.78%)	1073 (17.2%)	567 (22.68%)
# of Case report	2063 (8.27%)	482 (7.72%)	128 (5.12%)
# of Transmission	1088 (4.35%)	256 (4.1%)	41 (1.64%)
# of Epidemic forecasting	645 (2.58%)	192 (3.08%)	197 (7.88%)

Table 2.

The data distribution of the NLM-CHEM and LitCovid tracks in the BioCreative VII Challenge

	Training	Development	Test
NLM-CHEM200 Corpus
# of Articles	100	50	54
# of Chemical NE (those with a MeSH ID)	26 567 (26 339)	11 772 (11 660)	22 942 (22 777)
LitCovid Corpus
# of Articles	24 960	6239	2500
# of Prevention	11 102 (44.48%)	2750 (44.08%)	1035 (41.4%)
# of Treatment	8717 (34.2%)	2207 (35.37%)	722 (28.88%)
# of Diagnosis	6193 (24.81%)	1546 (24.78%)	926 (37.04%)
# of Mechanism	4438 (17.78%)	1073 (17.2%)	567 (22.68%)
# of Case report	2063 (8.27%)	482 (7.72%)	128 (5.12%)
# of Transmission	1088 (4.35%)	256 (4.1%)	41 (1.64%)
# of Epidemic forecasting	645 (2.58%)	192 (3.08%)	197 (7.88%)

	Training	Development	Test
NLM-CHEM200 Corpus
# of Articles	100	50	54
# of Chemical NE (those with a MeSH ID)	26 567 (26 339)	11 772 (11 660)	22 942 (22 777)
LitCovid Corpus
# of Articles	24 960	6239	2500
# of Prevention	11 102 (44.48%)	2750 (44.08%)	1035 (41.4%)
# of Treatment	8717 (34.2%)	2207 (35.37%)	722 (28.88%)
# of Diagnosis	6193 (24.81%)	1546 (24.78%)	926 (37.04%)
# of Mechanism	4438 (17.78%)	1073 (17.2%)	567 (22.68%)
# of Case report	2063 (8.27%)	482 (7.72%)	128 (5.12%)
# of Transmission	1088 (4.35%)	256 (4.1%)	41 (1.64%)
# of Epidemic forecasting	645 (2.58%)	192 (3.08%)	197 (7.88%)

The metrics used to evaluate the prediction performance of the NLM-CHEM track are precision, recall and F₁-score (in ‘strict’ and ‘approximate’ evaluation settings), as well as the micro-average used for comparing the overall performance. Specifically, for both NER and normalization tasks, the ‘strict’ setting expects an exact match between two spans, i.e. the predicted span of an entity/MeSH ID and the correctly annotated span/ID. On the other hand, the ‘approximate’ metric for the NER task considers a span as correct if it overlaps with the gold span.

As for the LitCovid track, the two most widely utilized metrics for multi-label categorization are label-based and instance-based assessment measures (36). Label-based evaluation independently judges each label, with associated measures calculating each label’s performance before aggregating the results for all labels. Instance-based measures, on the other hand, treat every instance as a separate entity. Similar to the NLM-CHEM track, this track also evaluates the precision, recall and F₁-score of instance-based results. The macro- and micro-averages were further adopted for estimating the performance of label-based matching.

The proposed model was implemented using PyTorch (https://pytorch.org/), a Python deep learning library. We adopted the common settings of optimizer and hyper-parameters for fine-tuning, i.e. 10 epochs of training time with the AdamW optimizer (37) using a learning rate of 2e-5. However, the weight decay was set to 1e-3 to improve stability during training. The batch size of 16 and 64 were set for the NLM-CHEM and LitCovid tracks, respectively. The maximum sequence length was 512 tokens, with padding or truncating at the end of the sequence. We ran the proposed model on two NVIDIA GeForce RTX 3090 GPUs.

Results and discussion

To conduct a comprehensive evaluation, we listed the benchmarks (BlueBERT (38) for NLM-CHEM; ML-Net (39) for LitCovid), median performance of the participating teams (MPT), and the top one system (T1S (40) for NLM-CHEM; Bioformer (41) for LitCovid), from both tracks as comparisons. Moreover, we also selected a collection of BERT variants that were used in our ensemble learning approach: BioBERT, PubMedBERT, BioM-ELECTRA-Large-Discriminator (BioM-D), BioM-ELECTRA-Large-SQuAD2 (BioM-S), bioformer-cased-v1.0 (Bioformer) for both tracks; and bioformer-cased-v1.0-bc2gm (Bioformer-B) for the LitCovid track, as comparisons.

Table 3 presents the performance of the Bioformer and the results of incrementally applying different pre-trained BERT models in the NLM-CHEM track. The performance can be further improved by integrating different BERT models incrementally under the ensemble learning framework. Consequently, applying them altogether achieves the best performance. In addition, we investigated the impact of different data sizes as shown in Table 4. In general, our system performance is not significantly affected by data size. The impact is relatively large only when there is only 10% of the data, in which the performance is greatly reduced with more than 10% reduction in the F1-scores. The results showed that the proposed method is robust and efficient in both tracks.

Table 3.

Incremental contribution of different BERT models for ensemble learning in the NLM-CHEM track

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
+Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7162/0.8635/0.7796
+BioM-D	0.8299/0.8419/0.8469	0.9216/0.9052/0.9133	0.7707/0.8312/0.7988	0.7271/0.8616/0.7856
+BioM-S	0.8294/0.8627/0.8457	0.8969/0.9276/0.9120	0.7697/0.8303/0.7988	0.7247/0.8588/0.7826
+PubMedBERT	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
+Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7162/0.8635/0.7796
+BioM-D	0.8299/0.8419/0.8469	0.9216/0.9052/0.9133	0.7707/0.8312/0.7988	0.7271/0.8616/0.7856
+BioM-S	0.8294/0.8627/0.8457	0.8969/0.9276/0.9120	0.7697/0.8303/0.7988	0.7247/0.8588/0.7826
+PubMedBERT	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

Table 3.

Incremental contribution of different BERT models for ensemble learning in the NLM-CHEM track

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
+Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7162/0.8635/0.7796
+BioM-D	0.8299/0.8419/0.8469	0.9216/0.9052/0.9133	0.7707/0.8312/0.7988	0.7271/0.8616/0.7856
+BioM-S	0.8294/0.8627/0.8457	0.8969/0.9276/0.9120	0.7697/0.8303/0.7988	0.7247/0.8588/0.7826
+PubMedBERT	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
+Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7162/0.8635/0.7796
+BioM-D	0.8299/0.8419/0.8469	0.9216/0.9052/0.9133	0.7707/0.8312/0.7988	0.7271/0.8616/0.7856
+BioM-S	0.8294/0.8627/0.8457	0.8969/0.9276/0.9120	0.7697/0.8303/0.7988	0.7247/0.8588/0.7826
+PubMedBERT	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

Table 4.

The impact of different data sizes in the NLM-CHEM track

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Data size	Precision/Recall/F₁-score
10%	0.7029/0.7247/0.7136	0.8296/0.8239/0.8268	0.6726/0.7414/0.7053	0.6425/0.7957/0.7061
20%	0.7679/0.8355/0.8003	0.8498/0.9138/0.8806	0.7276/0.8045/0.7641	0.6782/0.8438/0.7487
50%	0.8018/0.8680/0.8336	0.8776/0.9420/0.9087	0.7494/0.8257/0.7857	0.7060/0.8583/0.7704
100%	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Data size	Precision/Recall/F₁-score
10%	0.7029/0.7247/0.7136	0.8296/0.8239/0.8268	0.6726/0.7414/0.7053	0.6425/0.7957/0.7061
20%	0.7679/0.8355/0.8003	0.8498/0.9138/0.8806	0.7276/0.8045/0.7641	0.6782/0.8438/0.7487
50%	0.8018/0.8680/0.8336	0.8776/0.9420/0.9087	0.7494/0.8257/0.7857	0.7060/0.8583/0.7704
100%	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

Table 4.

The impact of different data sizes in the NLM-CHEM track

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Data size	Precision/Recall/F₁-score
10%	0.7029/0.7247/0.7136	0.8296/0.8239/0.8268	0.6726/0.7414/0.7053	0.6425/0.7957/0.7061
20%	0.7679/0.8355/0.8003	0.8498/0.9138/0.8806	0.7276/0.8045/0.7641	0.6782/0.8438/0.7487
50%	0.8018/0.8680/0.8336	0.8776/0.9420/0.9087	0.7494/0.8257/0.7857	0.7060/0.8583/0.7704
100%	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Data size	Precision/Recall/F₁-score
10%	0.7029/0.7247/0.7136	0.8296/0.8239/0.8268	0.6726/0.7414/0.7053	0.6425/0.7957/0.7061
20%	0.7679/0.8355/0.8003	0.8498/0.9138/0.8806	0.7276/0.8045/0.7641	0.6782/0.8438/0.7487
50%	0.8018/0.8680/0.8336	0.8776/0.9420/0.9087	0.7494/0.8257/0.7857	0.7060/0.8583/0.7704
100%	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

Table 5 presents the performance comparisons in the NLM-CHEM track. The overall outcome of BioBERT is F1-scores of about 80% and 86% in strict and approximate evaluations, respectively, which is generally worse than all of the compared methods. This is most likely due to the fact that it is the first biomedical-specific BERT, and the scale of the training dataset is smaller than the other compared systems In contrast, BlueBERT pre-trained on the BLUE (Biomedical Language Understanding Evaluation) dataset (38), a much more complex corpus consisting of five tasks with ten datasets that covered both biomedical and clinical articles of various sizes and challenges. Hence, it surpassed BioBERT by about 2% in terms of F1-score. Furthermore, the PubMedBERT, BioM and Bioformer employed more pre-training data, and therefore, achieved a more fine-tuned performance with F1-scores of 83% and 90% in the strict and approximate evaluations, respectively. Their performances significantly outperformed the BlueBERT, and they were even superior to the median performance of the participating teams in this track. It is noteworthy that the ensemble learning-based method, T1S, and our proposed method can further enhance the overall performances by 3%, therefore achieving F1-scores of 86% and 92% in the strict and approximate evaluations, respectively. This indicates that integrating multiple BERT models can advance the performance for full-text chemical identification significantly. It is interesting to note that T1S achieved the best precision, and the reason for this is that the tagging consistency and entity coverage are improved through majority voting. The ensemble method of T1S focused on the inconsistent predictions in the same article, and it computed the majority for model predictions and changed all the minority predictions to the majority label. In this way, the ensemble mechanism was the majority voting from all predictions from individual models within an article. Our method thus achieved the best recall. We postulate that because our ensemble approach integrated multiple outputs from different BERT models, it obtained a better generalization of the textual structures of chemical entities. This, therefore, facilitated the learning of characteristics of chemical identification for each structural type, which in turn increased the recall rate. In addition, our proposed MeSH ID normalization algorithm is effective in chemical entity normalization, which achieved F1-score of about 80% in both strict and approximate evaluations. It is observed that the strict and approximate scores do not differ much. This is possibly due to the short token length, which resulted in the efficient use of the edit distance-based method to partially match token sequences in search of the correct answer in the MeSH hierarchy.

Table 5.

The performance results of the methods in the NLM-CHEM track

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
BlueBERT	0.8440/0.7877/0.8149	0.9156/0.8492/0.8811	0.8151/0.7644/0.7899	0.7917/0.7889/0.7857
MPT	0.8476/0.8136/0.8373	0.9220/0.8682/0.8951	0.7120/0.7760/0.7749	0.6782/0.8402/0.7552
BioBERT	0.8010/0.7830/0.7919	0.8773/0.8528/0.8649	0.7582/0.8205/0.7881	0.7096/0.8497/0.7690
PubMedBERT	0.8488/0.8542/0.8515	0.9184/0.9171/0.9177	0.7788/0.8272/0.8023	0.7354/0.8586/0.7889
BioM-S	0.8583/0.8457/0.8520	0.9246/0.9055/0.9149	0.7816/0.8290/0.8046	0.7374/0.8613/0.7898
BioM-D	0.8520/0.8419/0.8469	0.9216/0.9052/0.9133	0.7840/0.8275/0.8052	0.7432/0.8566/0.7923
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7166/0.8626/0.7793
T1S	0.8759/0.8587/0.8672	0.9373/0.9161/0.9266	0.8621/0.7702/0.8136	0.8302/0.7867/0.8030
Our method	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
BlueBERT	0.8440/0.7877/0.8149	0.9156/0.8492/0.8811	0.8151/0.7644/0.7899	0.7917/0.7889/0.7857
MPT	0.8476/0.8136/0.8373	0.9220/0.8682/0.8951	0.7120/0.7760/0.7749	0.6782/0.8402/0.7552
BioBERT	0.8010/0.7830/0.7919	0.8773/0.8528/0.8649	0.7582/0.8205/0.7881	0.7096/0.8497/0.7690
PubMedBERT	0.8488/0.8542/0.8515	0.9184/0.9171/0.9177	0.7788/0.8272/0.8023	0.7354/0.8586/0.7889
BioM-S	0.8583/0.8457/0.8520	0.9246/0.9055/0.9149	0.7816/0.8290/0.8046	0.7374/0.8613/0.7898
BioM-D	0.8520/0.8419/0.8469	0.9216/0.9052/0.9133	0.7840/0.8275/0.8052	0.7432/0.8566/0.7923
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7166/0.8626/0.7793
T1S	0.8759/0.8587/0.8672	0.9373/0.9161/0.9266	0.8621/0.7702/0.8136	0.8302/0.7867/0.8030
Our method	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

Table 5.

The performance results of the methods in the NLM-CHEM track

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
BlueBERT	0.8440/0.7877/0.8149	0.9156/0.8492/0.8811	0.8151/0.7644/0.7899	0.7917/0.7889/0.7857
MPT	0.8476/0.8136/0.8373	0.9220/0.8682/0.8951	0.7120/0.7760/0.7749	0.6782/0.8402/0.7552
BioBERT	0.8010/0.7830/0.7919	0.8773/0.8528/0.8649	0.7582/0.8205/0.7881	0.7096/0.8497/0.7690
PubMedBERT	0.8488/0.8542/0.8515	0.9184/0.9171/0.9177	0.7788/0.8272/0.8023	0.7354/0.8586/0.7889
BioM-S	0.8583/0.8457/0.8520	0.9246/0.9055/0.9149	0.7816/0.8290/0.8046	0.7374/0.8613/0.7898
BioM-D	0.8520/0.8419/0.8469	0.9216/0.9052/0.9133	0.7840/0.8275/0.8052	0.7432/0.8566/0.7923
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7166/0.8626/0.7793
T1S	0.8759/0.8587/0.8672	0.9373/0.9161/0.9266	0.8621/0.7702/0.8136	0.8302/0.7867/0.8030
Our method	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

	Chemical mention recognition		Chemical normalization to MeSH IDs
	Strict	Approximate	Strict	Approximate
Systems	Precision/Recall/F₁-score
BlueBERT	0.8440/0.7877/0.8149	0.9156/0.8492/0.8811	0.8151/0.7644/0.7899	0.7917/0.7889/0.7857
MPT	0.8476/0.8136/0.8373	0.9220/0.8682/0.8951	0.7120/0.7760/0.7749	0.6782/0.8402/0.7552
BioBERT	0.8010/0.7830/0.7919	0.8773/0.8528/0.8649	0.7582/0.8205/0.7881	0.7096/0.8497/0.7690
PubMedBERT	0.8488/0.8542/0.8515	0.9184/0.9171/0.9177	0.7788/0.8272/0.8023	0.7354/0.8586/0.7889
BioM-S	0.8583/0.8457/0.8520	0.9246/0.9055/0.9149	0.7816/0.8290/0.8046	0.7374/0.8613/0.7898
BioM-D	0.8520/0.8419/0.8469	0.9216/0.9052/0.9133	0.7840/0.8275/0.8052	0.7432/0.8566/0.7923
Bioformer	0.8156/0.8576/0.8361	0.8846/0.9236/0.9037	0.7570/0.8294/0.7915	0.7130/0.8596/0.7756
Bioformer-B	0.8140/0.8558/0.8344	0.8847/0.9249/0.9044	0.7652/0.8306/0.7965	0.7166/0.8626/0.7793
T1S	0.8759/0.8587/0.8672	0.9373/0.9161/0.9266	0.8621/0.7702/0.8136	0.8302/0.7867/0.8030
Our method	0.8535/0.8622/0.8578	0.9201/0.9237/0.9219	0.7835/0.8303/0.8062	0.7448/0.8570/0.7933

For the performance evaluation of the LitCovid track, Table 6 illustrates the incremental performance of utilizing different pre-trained BERT models, and Table 7 presents the impact on performance due to different data sizes. The results are identical to the NLM-CHEM track, in which integrating effective models altogether achieved the best performance, and the proposed BERT-based ensemble approach is not only efficient but also robust due to its ability to achieve remarkable performance with different sizes of the dataset. Table 8 displays the performances of the compared systems on the multi-label topic classification in the LitCovid track. The baseline method, ML-Net, is a BiLSTM-based neural network. It had a mediocre performance with F1-scores of 76.6% and 86.8 on the label-based macro average and instance-based, respectively. The BERT-based models can significantly improve the performance by about 10% in F₁-score in both label-based and instance-based evaluations. Interestingly, BioBERT outperformed almost all of the comparisons and is comparable to Bioformer, which differs from the performance obtained in the NLM-CHEM track. The Bioformer was pre-trained on the three different sources of abstracts from PubMed, full-text from one million PMC articles, and approximately 20 000 abstracts of COVID-19 publications. It thus achieved the best performance among the participating teams in the LitCovid track. In this paper, we used the Bioformer to integrate the advantages of different BERT models by means of a majority voting mechanism. For this reason, the proposed method outperformed all of the comparisons and achieved the state-of-the-art performance on the LitCovid corpus.

Table 6.

Incremental contribution of different BERT models for ensemble learning in the LitCovid track

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
+BioBERT	0.9170/0.9165/0.9167	0.8815/0.8902/0.8818	0.9355/0.9367/0.9361
+BioM-S	0.9240/0.9140/0.9189	0.9001/0.8759/0.8858	0.9403/0.9357/0.9380
+PubMedBERT	0.9303/0.9076/0.9188	0.9128/0.8681/0.8865	0.9454/0.9321/0.9387
+BioM-D	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
+BioBERT	0.9170/0.9165/0.9167	0.8815/0.8902/0.8818	0.9355/0.9367/0.9361
+BioM-S	0.9240/0.9140/0.9189	0.9001/0.8759/0.8858	0.9403/0.9357/0.9380
+PubMedBERT	0.9303/0.9076/0.9188	0.9128/0.8681/0.8865	0.9454/0.9321/0.9387
+BioM-D	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

Table 6.

Incremental contribution of different BERT models for ensemble learning in the LitCovid track

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
+BioBERT	0.9170/0.9165/0.9167	0.8815/0.8902/0.8818	0.9355/0.9367/0.9361
+BioM-S	0.9240/0.9140/0.9189	0.9001/0.8759/0.8858	0.9403/0.9357/0.9380
+PubMedBERT	0.9303/0.9076/0.9188	0.9128/0.8681/0.8865	0.9454/0.9321/0.9387
+BioM-D	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
+BioBERT	0.9170/0.9165/0.9167	0.8815/0.8902/0.8818	0.9355/0.9367/0.9361
+BioM-S	0.9240/0.9140/0.9189	0.9001/0.8759/0.8858	0.9403/0.9357/0.9380
+PubMedBERT	0.9303/0.9076/0.9188	0.9128/0.8681/0.8865	0.9454/0.9321/0.9387
+BioM-D	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

Table 7.

The impact of different data sizes in the LitCovid track

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Data size	Precision/Recall/F₁-score
10%	0.9169/0.8908/0.9036	0.9087/0.8230/0.8537	0.9308/0.9179/0.9243
20%	0.9242/0.9002/0.9120	0.9089/0.8455/0.8690	0.9387/0.9255/0.9321
50%	0.9250/0.9137/0.9193	0.9114/0.8642/0.8826	0.9419/0.9354/0.9386
100%	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Data size	Precision/Recall/F₁-score
10%	0.9169/0.8908/0.9036	0.9087/0.8230/0.8537	0.9308/0.9179/0.9243
20%	0.9242/0.9002/0.9120	0.9089/0.8455/0.8690	0.9387/0.9255/0.9321
50%	0.9250/0.9137/0.9193	0.9114/0.8642/0.8826	0.9419/0.9354/0.9386
100%	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

Table 7.

The impact of different data sizes in the LitCovid track

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Data size	Precision/Recall/F₁-score
10%	0.9169/0.8908/0.9036	0.9087/0.8230/0.8537	0.9308/0.9179/0.9243
20%	0.9242/0.9002/0.9120	0.9089/0.8455/0.8690	0.9387/0.9255/0.9321
50%	0.9250/0.9137/0.9193	0.9114/0.8642/0.8826	0.9419/0.9354/0.9386
100%	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Data size	Precision/Recall/F₁-score
10%	0.9169/0.8908/0.9036	0.9087/0.8230/0.8537	0.9308/0.9179/0.9243
20%	0.9242/0.9002/0.9120	0.9089/0.8455/0.8690	0.9387/0.9255/0.9321
50%	0.9250/0.9137/0.9193	0.9114/0.8642/0.8826	0.9419/0.9354/0.9386
100%	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

Table 8.

The performance results of the methods in the LitCovid track

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
ML-Net	0.8756/0.8142/0.8437	0.8364/0.7309/0.7655	0.8849/0.8514/0.8678
MPT	0.8967/0.8624/0.8778	0.8670/0.8012/0.8191	0.8985/0.8887/0.8931
BioBERT	0.9343/0.9010/0.9174	0.9214/0.8417/0.8725	0.9440/0.9254/0.9346
PubMedBERT	0.9243/0.8946/0.9092	0.8933/0.8681/0.8740	0.9363/0.9214/0.9288
BioM-S	0.9214/0.8985/0.9098	0.9123/0.8590/0.8822	0.9359/0.9240/0.9299
BioM-D	0.9288/0.8838/0.9058	0.8975/0.8461/0.8648	0.9427/0.9140/0.9281
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
Our method	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
ML-Net	0.8756/0.8142/0.8437	0.8364/0.7309/0.7655	0.8849/0.8514/0.8678
MPT	0.8967/0.8624/0.8778	0.8670/0.8012/0.8191	0.8985/0.8887/0.8931
BioBERT	0.9343/0.9010/0.9174	0.9214/0.8417/0.8725	0.9440/0.9254/0.9346
PubMedBERT	0.9243/0.8946/0.9092	0.8933/0.8681/0.8740	0.9363/0.9214/0.9288
BioM-S	0.9214/0.8985/0.9098	0.9123/0.8590/0.8822	0.9359/0.9240/0.9299
BioM-D	0.9288/0.8838/0.9058	0.8975/0.8461/0.8648	0.9427/0.9140/0.9281
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
Our method	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

Table 8.

The performance results of the methods in the LitCovid track

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
ML-Net	0.8756/0.8142/0.8437	0.8364/0.7309/0.7655	0.8849/0.8514/0.8678
MPT	0.8967/0.8624/0.8778	0.8670/0.8012/0.8191	0.8985/0.8887/0.8931
BioBERT	0.9343/0.9010/0.9174	0.9214/0.8417/0.8725	0.9440/0.9254/0.9346
PubMedBERT	0.9243/0.8946/0.9092	0.8933/0.8681/0.8740	0.9363/0.9214/0.9288
BioM-S	0.9214/0.8985/0.9098	0.9123/0.8590/0.8822	0.9359/0.9240/0.9299
BioM-D	0.9288/0.8838/0.9058	0.8975/0.8461/0.8648	0.9427/0.9140/0.9281
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
Our method	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

	Label-based micro-avg.	Label-based macro-avg.	Instance-based
Systems	Precision/Recall/F₁-score
ML-Net	0.8756/0.8142/0.8437	0.8364/0.7309/0.7655	0.8849/0.8514/0.8678
MPT	0.8967/0.8624/0.8778	0.8670/0.8012/0.8191	0.8985/0.8887/0.8931
BioBERT	0.9343/0.9010/0.9174	0.9214/0.8417/0.8725	0.9440/0.9254/0.9346
PubMedBERT	0.9243/0.8946/0.9092	0.8933/0.8681/0.8740	0.9363/0.9214/0.9288
BioM-S	0.9214/0.8985/0.9098	0.9123/0.8590/0.8822	0.9359/0.9240/0.9299
BioM-D	0.9288/0.8838/0.9058	0.8975/0.8461/0.8648	0.9427/0.9140/0.9281
Bioformer	0.9367/0.9002/0.9181	0.9038/0.8823/0.8875	0.9414/0.9256/0.9334
Our method	0.9342/0.9062/0.9200	0.9155/0.8695/0.8881	0.9475/0.9311/0.9392

Table 9 presents the classification errors of each topic type with the false positive rate (FPR) and false negative rate (FNR) of the proposed method. It is observed that a relatively high proportion of FPR occurred in ‘Treatment’. This is because more than 40% of data is related to ‘Treatment’, which causes the model to be biased towards the majority class. The imbalanced data issues also affect small classes, such as ‘Epidemic Forecasting’ and ‘Transmission’. Based on our further analysis, we observed that all positive instances of ‘Epidemic Forecasting’ only co-occur with the negative instances of ‘Treatment’. However, the co-occurrence of ‘Transmission’ and ‘Treatment’ is mixed, which causes the proposed model to be more affected by the imbalanced data problem, and therefore, a great portion of FNR occurred in ‘Transmission’. Our error analysis shows that the performance improvement in multi-label classification remains limited, although we have adopted the BCEWithLogitsLoss as the loss function to alleviate the problem of data imbalance. An effective loss function to decrease the impact of imbalanced data issues shall be the foremost issue to be addressed in our future work.

Table 9.

Error distribution of the LitCovid track

LABEL (support)	#FP	#FN	FPR	FNR
Treatment (1035)	57	100	6.82%	5.50%
Diagnosis (722)	41	94	2.30%	13.01%
Prevention (926)	45	63	2.85%	6.80%
Machanism (567)	20	61	1.03%	9.82%
Transmission (128)	9	48	0.37%	37.50%
Epidemic forecasting (41)	10	5	0.40%	12.19%
Case report (197)	6	11	0.26%	5.58%

LABEL (support)	#FP	#FN	FPR	FNR
Treatment (1035)	57	100	6.82%	5.50%
Diagnosis (722)	41	94	2.30%	13.01%
Prevention (926)	45	63	2.85%	6.80%
Machanism (567)	20	61	1.03%	9.82%
Transmission (128)	9	48	0.37%	37.50%
Epidemic forecasting (41)	10	5	0.40%	12.19%
Case report (197)	6	11	0.26%	5.58%

Table 9.

Error distribution of the LitCovid track

LABEL (support)	#FP	#FN	FPR	FNR
Treatment (1035)	57	100	6.82%	5.50%
Diagnosis (722)	41	94	2.30%	13.01%
Prevention (926)	45	63	2.85%	6.80%
Machanism (567)	20	61	1.03%	9.82%
Transmission (128)	9	48	0.37%	37.50%
Epidemic forecasting (41)	10	5	0.40%	12.19%
Case report (197)	6	11	0.26%	5.58%

LABEL (support)	#FP	#FN	FPR	FNR
Treatment (1035)	57	100	6.82%	5.50%
Diagnosis (722)	41	94	2.30%	13.01%
Prevention (926)	45	63	2.85%	6.80%
Machanism (567)	20	61	1.03%	9.82%
Transmission (128)	9	48	0.37%	37.50%
Epidemic forecasting (41)	10	5	0.40%	12.19%
Case report (197)	6	11	0.26%	5.58%

The COVID-19 pandemic has had a wide-ranging influence on society, causing increased death and morbidity, as well as interruptions in daily life and overall unease. Many of these issues are unique in terms of type, scope or cause, and one of the most effective methods to solve them is to have better information, that is, the right amount of precise data at the point where it can be implemented (42). However, the difficulty in locating credible and practical knowledge unique to a given context triggered a second epidemic: information overload, which was compounded by the disease’s evolving understanding and a wave of article retractions from even the most prestigious publications. Meanwhile, members of the public were subjected to severe psychological stress as a result of shifting public health policies, severe economic consequences and health uncertainties, all while dealing with their own information overload via news and social media, which was exacerbated by inconsistent messaging and deliberate misinformation campaigns. However, many existing NLP tasks can directly address information requirements during the COVID-19 epidemic, and our proposed method showed the promising results just by improving on existing NLP tasks.

In addition, the establishment of COVID Moonshot and collaboration between COVID Moonshot and PostEra, a startup focusing on medicinal chemistry powered by machine learning, to deliver an antiviral drug for COVID, showcased the potential for drug discovery to be accelerated with the assistance from machine learning. This is beneficial to the world as more breakthroughs may be achieved for more diseases in a shorter duration, bringing possible cures to more people.

Concluding remarks

BioNLP is gaining importance due to the huge yearly increases in the publication of biomedical literature that makes manual curation very challenging. In this research, we introduced a BERT-based ensemble learning approach for the NLM-CHEM and LitCovid tracks in the BioCreative VII Challenge. We explored various state-of-the-art biomedical-specific pre-trained BERT models in both tracks. As the different BERT models have their own characteristics, they also had their distinct advantages which enabled them to perform well. Therefore, by combining them through ensemble learning, the system’s performance can be improved. For the NLM-CHEM track, our model achieved remarkable performance in chemical identification. We further proposed a MeSH ID normalization algorithm for the normalization of chemical entities. The experiment results demonstrated that the dynamic programming-based method is effective in normalizing chemical entities. As for the LitCovid track, our BERT-based ensemble approach achieved state-of-the-art performance in detecting topics in the COVID-19 literature. In addition, this study also explores the performance of various BERT-based models in the NLM-CHEM and LitCovid tasks. We have proved that the integration of BERT models using ensemble learning can further improve the system performance. The results are able to contribute to future research while addressing both tasks.

In the future, deeper semantic information will be integrated into the BERT architecture by exploring other aspects, such as the dependency construction in texts. We will also use relation extraction algorithms to recognize chemical relation passages and construct the relation network of chemicals.

Acknowledgements

This research was supported by the Ministry of Science and Technology of Taiwan under grants MOST 110-2634-F-038-006, MOST 110-2634-F-A49-004, and MOST 109-2410-H-038-012-MY2.

Conflict of interest

None declared.

References

Zhang

Mishra

Brynjolfsson

et al. (

2021

)

The ai index 2021 annual report

arXiv preprint arXiv:2103.06312

and

Liu

(

2012

) Text analytics in social media. In: Aggarwal, C., Zhai, C. (eds)

Mining Text Data

Springer

, Boston, MA, pp.

385

–

414

Tan

A.-H.

(

1999

)

Text mining: the state of the art and the challenges

. In:

Proceedings of the pakdd 1999 workshop on knowledge disocovery from advanced databases

Citeseer

, Beijing, Vol.

, pp.

–

Manning

and

Schutze

(

1999

)

Foundations of Statistical Natural Language Processing

MIT press

, Cambridge.

Google Preview

Torfi

Shirvani

R.A.

Keneshloo

et al. (

2020

)

Natural language processing advancements by deep learning: a survey

. arXiv preprint arXiv:2003.01200.

Naseem

Razzak

Khan

S.K.

et al. (

2021

)

A comprehensive survey on word representation models: from classical to state-of-the-art word representation language models

Transactions on Asian and Low-Resource Language Information Processing

–

Fiorini

Lipman

D.J.

and

(

2017

)

Cutting edge: towards PubMed 2.0

Elife

, e28801.

Cariello

M.C.

Lenci

and

Mitkov

(

2021

)

A comparison between named entity recognition models in the biomedical domain

. In:

INCOMA Ltd., Held Online

, Proceedings of the Translation and Interpreting Technology Online Conference, pp.

–

Corbett

and

Boyle

(

2018

)

Chemlistem: chemical named entity recognition using recurrent neural networks

J. Cheminform.

–

10.

Hong

and

Lee

J.-G.

(

2020

)

DTranNER: biomedical named entity recognition with deep learning-based label-label transition model

BMC Bioinform.

–

11.

Chang

Y.-C.

Chu

C.-H.

Y.-C.

et al. (

2016

)

PIPE: a protein–protein interaction passage extraction module for BioCreative challenge

Database

2016

, baw101.

12.

Sun

Qian

et al. (

2019

)

Chemical-induced disease relation extraction via attention-based distant supervision

BMC Bioinform.

–

13.

Wei

C.-H.

Peng

Leaman

et al. (

2016

)

Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task

Database

2016

, baw032.

14.

Sun

Johnson

R.J.

et al. (

2016

)

BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Database

2016

, baw068.

15.

Zhou

Deng

Chen

et al. (

2016

)

Exploiting syntactic and semantics information for chemical–disease relation extraction

Database

2016

, baw048.

16.

Sun

Qian

et al. (

2017

)

Chemical-induced disease relation extraction via convolutional neural network

Database

2017

, bax024.

17.

Alrowili

and

Vijay-Shanker

(

2021

)

BioM-transformers: building large biomedical language models with BERT, ALBERT and ELECTRA

. In:

Proceedings of the 20th Workshop on Biomedical Language Processing

, Online. Association for Computational Linguistics, pp.

221

–

227

18.

Clark

Luong

M.-T.

Q.V.

et al. (

2020

)

Electra: pre-training text encoders as discriminators rather than generators

. arXiv preprint arXiv:2003.10555.

19.

Wahbeh

Nasralah

Al-Ramahi

et al. (

2020

)

Mining physicians’ opinions on social media to obtain insights into COVID-19: mixed methods analysis

JMIR Public Health Surveillance

, e19276.

20.

Chaudhary

and

Zhang

(

2020

)

Modeling spatiotemporal pattern of depressive symptoms caused by COVID-19 using social media data mining

Int. J. Environ. Res. Public Health

, 4988.

21.

Chen

Allot

and

(

2021

)

LitCovid: an open database of COVID-19 literature

Nucleic Acids Res.

D1534

–

D1540

22.

Schuster

Chen

et al. (

2016

)

Google’s neural machine translation system: bridging the gap between human and machine translation

. arXiv preprint arXiv:1609.08144.

23.

Vaswani

Shazeer

Parmar

et al. (

2017

)

Attention is all you need

Adv. Neural Inf. Process Syst.

6000

–

6010

24.

Nielsen

and

Sun

(

2016

)

Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities

CoRR

, 442.

25.

Hande

Puranik

Priyadharshini

et al. (

2021

)

Evaluating pretrained transformer-based models for COVID-19 fake news detection

. In:

Proceedings of the 5th International Conference on Computing Methodologies and Communication (ICCMC), IEEE

, pp.

766

–

772

26.

Lewis

Mahmoodi

Zhou

et al. (

2021

)

Improving Tuberculosis (TB) Prediction using Synthetically Generated Computed Tomography (CT) Images

. In:

Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

, pp.

3265

–

3273

27.

Melekhov

Tiulpin

Sattler

et al. (

2019

)

Dgc-net: Dense geometric correspondence network

. In:

2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE

, pp.

1034

–

1042

28.

Lee

Yoon

Kim

et al. (

2020

)

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Bioinformatics

1234

–

1240

PubMed

29.

Tinn

Cheng

et al. (

2021

)

Domain-specific language model pretraining for biomedical natural language processing

ACM Trans. Comput. Healthcare (HEALTH)

–

30.

Devlin

Chang

M.-W.

Lee

et al. (

2019

)

BERT: pre-training of deep bidirectional transformers for language understanding

. In:

Association for Computational Linguistics

Minneapolis, Minnesota

, pp.

4171

–

4186

31.

Lan

Chen

Goodman

et al. (

2019

)

Albert: A lite bert for self-supervised learning of language representations

. arXiv preprint arXiv:1909.11942.

32.

Chen

Allot

Leaman

et al. (

2021

)

Overview of the BioCreative VII LitCovid track: multi-label topic classification for COVID-19 literature annotation

. In:

Proceedings of the SEVENTH BIOCREATIVE CHALLENGE EVALUATION WORKSHOP

. arXiv preprint arXiv:2204.09781.

33.

Smith

Tanabe

L.K.

Kuo

C.-J.

et al. (

2008

)

Overview of BioCreative II gene mention recognition

Genome Biol.

–

34.

Levenshtein

V.I.

(

1966

)

Binary codes capable of correcting deletions, insertions, and reversals

Soviet Physics Doklady

707

–

710

35.

Islamaj

Leaman

Cissel

et al.

The chemical corpus of the NLM-Chem BioCreative VII track

36.

Zhang

M.-L.

and

Zhou

Z.-H.

(

2013

)

A review on multi-label learning algorithms

IEEE Trans. Knowl. Data Eng.

1819

–

1837

37.

Loshchilov

and

Hutter

(

2017

)

Decoupled weight decay regularization

. arXiv preprint arXiv:1711.05101.

38.

Peng

Yan

and

(

2019

)

Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets

. arXiv preprint arXiv:1906.05474.

39.

Chen

Peng

et al. (

2019

)

ML-Net: multi-label classification of biomedical texts with deep neural networks

J. Am. Med. Inform. Assoc.

1279

–

1285

40.

Kim

Sung

Yoon

et al. (

2021

)

Improving tagging consistency and entity coverage for chemical identification in full-text articles

. arXiv preprint arXiv:2111.10584.

41.

Fang

and

Wang

Team bioformer at BioCreative VII LitCovid track: multic-label topic classification for COVID-19 literature with a compact BERT model

42.

Chen

Leaman

Allot

et al. (

2021

)

Artificial intelligence in action: addressing the COVID-19 pandemic with natural language processing

Annu. Rev. Biomed. Data Sci.

313

–

339

43.

King

and

Zeng

(

2001

)

Logistic regression in rare events data

Political Anal.

137

–

163