Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling

Comparison of annotation counts between tokenization approaches

*Re-tokenization*	*Protein*	*Cellular*	*Tissue*	*Molecule*	*Cell*	*Organism*
Without	97.178	99.772	95.951	96.107	97.099	93.691
With	99.187	99.866	99.842	99.424	99.559	98.921

*Re-tokenization*	*Protein*	*Cellular*	*Tissue*	*Molecule*	*Cell*	*Organism*
Without	97.178	99.772	95.951	96.107	97.099	93.691
With	99.187	99.866	99.842	99.424	99.559	98.921

The comparison of annotation counts between preprocessing with only NERsuite tokenization module (without) and with both NERsuite tokenization and additional tokenization (with). The numbers are percents of annotations compared to the provided data presented for each entity type.

Table 1

Comparison of annotation counts between tokenization approaches

*Re-tokenization*	*Protein*	*Cellular*	*Tissue*	*Molecule*	*Cell*	*Organism*
Without	97.178	99.772	95.951	96.107	97.099	93.691
With	99.187	99.866	99.842	99.424	99.559	98.921

*Re-tokenization*	*Protein*	*Cellular*	*Tissue*	*Molecule*	*Cell*	*Organism*
Without	97.178	99.772	95.951	96.107	97.099	93.691
With	99.187	99.866	99.842	99.424	99.559	98.921

While training a single model for all types of entities offers a relatively good performance, the model is tuned toward predicting Protein, the entity type with highest frequency in the training data. As a result, the performance of the model on other entities, such as Cellular, Molecule and Tissue, is lower than the overall performance. We thus resolve the issue by training NER model to detect entity types individually. The performance of these two training schemes for each entity type is shown in Table 2.

Table 2

Comparison of NER system on the development data

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.796 / 0.698 / 0.744	0.796 / 0.714 / 0.752	0.803 / 0.639 / 0.712
Cellular	0.759 / 0.611 / 0.677	0.710 / 0.682 / 0.696	0.725 / 0.633 / 0.676
Protein	0.771 / 0.726 / 0.748	0.755 / 0.738 / 0.746	0.833 / 0.779 / 0.805
Organism	0.878 / 0.696 / 0.776	0.872 / 0.757 / 0.810	0.856 / 0.713 / 0.778
Molecule	0.825 / 0.579 / 0.681	0.724 / 0.653 / 0.687	0.740 / 0.595 / 0.659
Tissue	0.816 / 0.566 / 0.668	0.750 / 0.696 / 0.722	0.730 / 0.607 / 0.663
*All*	0.788 / 0.686 / 0.734	0.761 / 0.721 / 0.741	0.809 / 0.718 / 0.761

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.796 / 0.698 / 0.744	0.796 / 0.714 / 0.752	0.803 / 0.639 / 0.712
Cellular	0.759 / 0.611 / 0.677	0.710 / 0.682 / 0.696	0.725 / 0.633 / 0.676
Protein	0.771 / 0.726 / 0.748	0.755 / 0.738 / 0.746	0.833 / 0.779 / 0.805
Organism	0.878 / 0.696 / 0.776	0.872 / 0.757 / 0.810	0.856 / 0.713 / 0.778
Molecule	0.825 / 0.579 / 0.681	0.724 / 0.653 / 0.687	0.740 / 0.595 / 0.659
Tissue	0.816 / 0.566 / 0.668	0.750 / 0.696 / 0.722	0.730 / 0.607 / 0.663
*All*	0.788 / 0.686 / 0.734	0.761 / 0.721 / 0.741	0.809 / 0.718 / 0.761

Evaluation of the original model submitted to the shared task (combined), improved CRF model (Independent) and the neural character level model (CNN-BiLSTM-CRF) based on the official evaluation script with strict entity span matching on the development data. Numbers within cells are precision/recall/F-score.

Table 2

Comparison of NER system on the development data

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.796 / 0.698 / 0.744	0.796 / 0.714 / 0.752	0.803 / 0.639 / 0.712
Cellular	0.759 / 0.611 / 0.677	0.710 / 0.682 / 0.696	0.725 / 0.633 / 0.676
Protein	0.771 / 0.726 / 0.748	0.755 / 0.738 / 0.746	0.833 / 0.779 / 0.805
Organism	0.878 / 0.696 / 0.776	0.872 / 0.757 / 0.810	0.856 / 0.713 / 0.778
Molecule	0.825 / 0.579 / 0.681	0.724 / 0.653 / 0.687	0.740 / 0.595 / 0.659
Tissue	0.816 / 0.566 / 0.668	0.750 / 0.696 / 0.722	0.730 / 0.607 / 0.663
*All*	0.788 / 0.686 / 0.734	0.761 / 0.721 / 0.741	0.809 / 0.718 / 0.761

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.796 / 0.698 / 0.744	0.796 / 0.714 / 0.752	0.803 / 0.639 / 0.712
Cellular	0.759 / 0.611 / 0.677	0.710 / 0.682 / 0.696	0.725 / 0.633 / 0.676
Protein	0.771 / 0.726 / 0.748	0.755 / 0.738 / 0.746	0.833 / 0.779 / 0.805
Organism	0.878 / 0.696 / 0.776	0.872 / 0.757 / 0.810	0.856 / 0.713 / 0.778
Molecule	0.825 / 0.579 / 0.681	0.724 / 0.653 / 0.687	0.740 / 0.595 / 0.659
Tissue	0.816 / 0.566 / 0.668	0.750 / 0.696 / 0.722	0.730 / 0.607 / 0.663
*All*	0.788 / 0.686 / 0.734	0.761 / 0.721 / 0.741	0.809 / 0.718 / 0.761

As shown in Table 3, the independent entity models yield better F-scores for all entity types, except for Protein, by increasing recall while lowering precision. This is due to the fact that the best performing hyperparameters for each entity type are selected independently. In the single model approach the same hyperparameter values are used for all entity types, which results in them being dictated by the most common entity-type Protein. This can be seen during the optimization as hyperparameters for both independent model for tagging Protein and combined model are exactly the same. As a result, these parameters are thus suboptimal for tagging other entity types. While training a CRF-based model for multiple entity types can yield a better system performance, as the model can rely on dependencies between certain entity types, this is not the case in our experiment.

Even though optimizing the hyperparameters separately has the risk of overfitting on the development set, this does not seem to be the case in our experiments as training the independent models improves the performance of the system by 0.7 pp and 0.5 pp in F-score on development and test sets, respectively (Tables 3 and 4). The difference is most apparent on Cell, Cellular and Tissue entities with improvements of 1.5, 2.0 and 3.2 pp on the test set F-scores, respectively. Since these are less common entities than Protein, the influence on the overall score is not as pronounced.

Table 3

Official evaluation of NER system on the test data

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.783 / 0.708 / 0.743	0.767 / 0.749 / 0.758	0.769 / 0.641 / 0.699
Cellular	0.673 / 0.508 / 0.579	0.630 / 0.571 / 0.599	0.634 / 0.495 / 0.556
Protein	0.729 / 0.739 / 0.734	0.728 / 0.745 / 0.736	0.764 / 0.768 / 0.766
Organism	0.860 / 0.809 / 0.834	0.823 / 0.852 / 0.837	0.789 / 0.771 / 0.780
Molecule	0.775 / 0.587 / 0.668	0.661 / 0.681 / 0.671	0.667 / 0.595 / 0.629
Tissue	0.727 / 0.575 / 0.642	0.650 / 0.700 / 0.674	0.646 / 0.622 / 0.634
*All*	0.747 / 0.694 / 0.720	0.719 / 0.730 / 0.725	0.739 / 0.702 / 0.720

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.783 / 0.708 / 0.743	0.767 / 0.749 / 0.758	0.769 / 0.641 / 0.699
Cellular	0.673 / 0.508 / 0.579	0.630 / 0.571 / 0.599	0.634 / 0.495 / 0.556
Protein	0.729 / 0.739 / 0.734	0.728 / 0.745 / 0.736	0.764 / 0.768 / 0.766
Organism	0.860 / 0.809 / 0.834	0.823 / 0.852 / 0.837	0.789 / 0.771 / 0.780
Molecule	0.775 / 0.587 / 0.668	0.661 / 0.681 / 0.671	0.667 / 0.595 / 0.629
Tissue	0.727 / 0.575 / 0.642	0.650 / 0.700 / 0.674	0.646 / 0.622 / 0.634
*All*	0.747 / 0.694 / 0.720	0.719 / 0.730 / 0.725	0.739 / 0.702 / 0.720

Evaluation of the original model submitted to the shared task (combined), improved CRF model (independent) and the neural character level model (CNN-BiLSTM-CRF) based on the official evaluation script with strict entity span matching on the test set. Numbers within cells are precision/recall/F-score.

Table 3

Official evaluation of NER system on the test data

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.783 / 0.708 / 0.743	0.767 / 0.749 / 0.758	0.769 / 0.641 / 0.699
Cellular	0.673 / 0.508 / 0.579	0.630 / 0.571 / 0.599	0.634 / 0.495 / 0.556
Protein	0.729 / 0.739 / 0.734	0.728 / 0.745 / 0.736	0.764 / 0.768 / 0.766
Organism	0.860 / 0.809 / 0.834	0.823 / 0.852 / 0.837	0.789 / 0.771 / 0.780
Molecule	0.775 / 0.587 / 0.668	0.661 / 0.681 / 0.671	0.667 / 0.595 / 0.629
Tissue	0.727 / 0.575 / 0.642	0.650 / 0.700 / 0.674	0.646 / 0.622 / 0.634
*All*	0.747 / 0.694 / 0.720	0.719 / 0.730 / 0.725	0.739 / 0.702 / 0.720

*Entity*	*Combined entity model*	*Independent entity model*	*CNN-BiLSTM-CRF model*
Cell	0.783 / 0.708 / 0.743	0.767 / 0.749 / 0.758	0.769 / 0.641 / 0.699
Cellular	0.673 / 0.508 / 0.579	0.630 / 0.571 / 0.599	0.634 / 0.495 / 0.556
Protein	0.729 / 0.739 / 0.734	0.728 / 0.745 / 0.736	0.764 / 0.768 / 0.766
Organism	0.860 / 0.809 / 0.834	0.823 / 0.852 / 0.837	0.789 / 0.771 / 0.780
Molecule	0.775 / 0.587 / 0.668	0.661 / 0.681 / 0.671	0.667 / 0.595 / 0.629
Tissue	0.727 / 0.575 / 0.642	0.650 / 0.700 / 0.674	0.646 / 0.622 / 0.634
*All*	0.747 / 0.694 / 0.720	0.719 / 0.730 / 0.725	0.739 / 0.702 / 0.720

Table 4

Official evaluation of NEN system on the development data

*Entity*	*Micro-averaged score*
*Entity*	*Our system (submitted to Bio-ID task)*	*Our system (this work)*
Cell	0.733 / 0.770 / 0.751	0.715 / 0.766 / 0.740
Cellular	0.478 / 0.493 / 0.485	0.462 / 0.491 / 0.476
Protein	0.445 / 0.315 / 0.369	0.410 / 0.360 / 0.383
Organism	0.724 / 0.669 / 0.695	0.802 / 0.668 / 0.729
Molecule	0.292 / 0.187 / 0.228	0.595 / 0.587 / 0.591
Tissue	0.598 / 0.672 / 0.633	0.592 / 0.680 / 0.633

*Entity*	*Micro-averaged score*
*Entity*	*Our system (submitted to Bio-ID task)*	*Our system (this work)*
Cell	0.733 / 0.770 / 0.751	0.715 / 0.766 / 0.740
Cellular	0.478 / 0.493 / 0.485	0.462 / 0.491 / 0.476
Protein	0.445 / 0.315 / 0.369	0.410 / 0.360 / 0.383
Organism	0.724 / 0.669 / 0.695	0.802 / 0.668 / 0.729
Molecule	0.292 / 0.187 / 0.228	0.595 / 0.587 / 0.591
Tissue	0.598 / 0.672 / 0.633	0.592 / 0.680 / 0.633

Comparison of our normalization systems on development data with gold standard entities. Numbers within table cells are precision/recall/F-score.

Table 4

Official evaluation of NEN system on the development data

*Entity*	*Micro-averaged score*
*Entity*	*Our system (submitted to Bio-ID task)*	*Our system (this work)*
Cell	0.733 / 0.770 / 0.751	0.715 / 0.766 / 0.740
Cellular	0.478 / 0.493 / 0.485	0.462 / 0.491 / 0.476
Protein	0.445 / 0.315 / 0.369	0.410 / 0.360 / 0.383
Organism	0.724 / 0.669 / 0.695	0.802 / 0.668 / 0.729
Molecule	0.292 / 0.187 / 0.228	0.595 / 0.587 / 0.591
Tissue	0.598 / 0.672 / 0.633	0.592 / 0.680 / 0.633

*Entity*	*Micro-averaged score*
*Entity*	*Our system (submitted to Bio-ID task)*	*Our system (this work)*
Cell	0.733 / 0.770 / 0.751	0.715 / 0.766 / 0.740
Cellular	0.478 / 0.493 / 0.485	0.462 / 0.491 / 0.476
Protein	0.445 / 0.315 / 0.369	0.410 / 0.360 / 0.383
Organism	0.724 / 0.669 / 0.695	0.802 / 0.668 / 0.729
Molecule	0.292 / 0.187 / 0.228	0.595 / 0.587 / 0.591
Tissue	0.598 / 0.672 / 0.633	0.592 / 0.680 / 0.633

Comparison of our normalization systems on development data with gold standard entities. Numbers within table cells are precision/recall/F-score.

With the neural approach a significant improvement of +2.0 pp over the underlying system can be seen on the development set. This improvement is solely caused by increased precision, which is intuitive as the purpose of the model is mostly to correct the existing predictions instead of detecting new ones. Unfortunately these promising results translate to a decrease of 0.5 pp on the test set compared against the NERsuite-based model. We have not done an exhaustive search over the neural network architectures or hyperparameters but mostly follow decisions made in previous studies. Thus we believe that the overfitting on the development data is caused by the early stopping procedure and could be alleviated by increasing the development set slightly at the expense of the training set.

NEN and disambiguation

The performance of our normalization system is heavily dependent on the NER system performance since unrecognized and incorrect entity spans are automatically classified as false negatives and false positives, respectively. We thus evaluate our normalization system on the development set based on the gold standard entity mentions to compare the different approaches on different entity types.

As shown in Table 4, our normalization system submitted to the Bio-ID task performs moderately on Cell, Cellular, Organism and Tissue, where the F-score ranges from 0.485 to 0.751; however, the performance drops dramatically when evaluated on Molecule and Protein. In this study, we thus focus on improving the system for Molecule and Protein normalization.

For Molecule, we have improved our dictionary coverage and changed the rule to prefer assigning ChEBI identifier to the entity span as mentioned in the Method section. The latter change has the most significant impact on the system performance, increasing F-score by more than 25 pp.

For Protein, normalization is also slightly improved, however less significantly with only 1 pp increase in F-score. Unlike Molecule, our Protein normalization system primarily depends on the accuracy of both exact strings matching and the Organism normalization. As the former component remains unchanged, the improvement is solely determined by the latter, the Organism normalization. This influence can also be seen by using gold standard Organism mentions and identifiers: the precision, recall and F-score of Protein normalization increases to 0.451, 0.397 and 0.422, respectively. This overall 4 pp increase in F-score on Protein normalization demonstrates that correctly normalizing the Organism plays an important but only a limited role in our current Protein normalization system. Significant gains should be thus expected by improving the Protein normalization system itself.

For Cell, Cellular and Tissue, the performance of the system remains unchanged or slightly drops from our submission result. We suspect this is due to the lack of disambiguating rules if multiple matching identifiers are found. The result is thus probably an oscillation of accuracy for randomly selected identifiers.

We finally combine our normalization system with the newly developed NER systems and evaluate their combined performance on test data set. The performance of the current systems is compared against our previously submitted predictions and the results from the best performing systems.

As shown in Table 5, both of our systems developed in this work have relatively similar performance to our submitted system for all entity types, except for Molecule, Cell and Organism. For Organism and Molecule, the heightened performance can be attributed to positive effects of both NER and NEN systems. For Cell, however, the improvement on normalization score can be only explained by the improvement on NER system as our current normalization has introduced no further improvement, on the contrary actually lowering the F-score on the normalization of this entity type, while evaluated on gold standard entities. A mere increase of 1 pp in F-score on NER for Cell can subsequently translate into over 5 pp F-score improvement of the integrated system.

Table 5

Official evaluation of NER and NEN systems on the test data

*Entity*	*CRF-based combined entity model*	*CRF-based independent entity model*	*CNN-BiLSTM-CRF model*	*Best performing system*	*References*
Cell	0.600 / 0.576 / 0.588	0.630 / 0.664 / 0.647	0.674 / 0.610 / 0.641	0.784 / 0.557 / 0.651	Sheng et al. (14)
Cellular	0.456 / 0.371 / 0.410	0.404 / 0.423 / 0.413	0.391 / 0.376 / 0.383	0.550 / 0.450 / 0.495	Sheng et al. (14)
Protein	0.472 / 0.343 / 0.397	0.456 / 0.358 / 0.401	0.445 / 0.388 / 0.415	0.472 / 0.343 / 0.397	Our submitted system
Organism	0.668 / 0.667 / 0.667	0.753 / 0.725 / 0.739	0.761 / 0.703 / 0.731	0.660 / 0.883 / 0.756	Singh and Dai (34)
Molecule	0.244 / 0.240 / 0.242	0.439 / 0.489 / 0.462	0.460 / 0.456 / 0.458	0.587 / 0.473 / 0.524	Sheng et al. (14)
Tissue	0.531 / 0.490 / 0.510	0.427 / 0.565 / 0.486	0.451 / 0.542 / 0.493	0.531 / 0.490 / 0.510	Our submitted system

*Entity*	*CRF-based combined entity model*	*CRF-based independent entity model*	*CNN-BiLSTM-CRF model*	*Best performing system*	*References*
Cell	0.600 / 0.576 / 0.588	0.630 / 0.664 / 0.647	0.674 / 0.610 / 0.641	0.784 / 0.557 / 0.651	Sheng et al. (14)
Cellular	0.456 / 0.371 / 0.410	0.404 / 0.423 / 0.413	0.391 / 0.376 / 0.383	0.550 / 0.450 / 0.495	Sheng et al. (14)
Protein	0.472 / 0.343 / 0.397	0.456 / 0.358 / 0.401	0.445 / 0.388 / 0.415	0.472 / 0.343 / 0.397	Our submitted system
Organism	0.668 / 0.667 / 0.667	0.753 / 0.725 / 0.739	0.761 / 0.703 / 0.731	0.660 / 0.883 / 0.756	Singh and Dai (34)
Molecule	0.244 / 0.240 / 0.242	0.439 / 0.489 / 0.462	0.460 / 0.456 / 0.458	0.587 / 0.473 / 0.524	Sheng et al. (14)
Tissue	0.531 / 0.490 / 0.510	0.427 / 0.565 / 0.486	0.451 / 0.542 / 0.493	0.531 / 0.490 / 0.510	Our submitted system

Comparison of our joint named entity recognition and normalization systems and the best performing systems in the shared task on the official test set. Numbers within table cells are micro-averaged precision/recall/F-score.

Table 5

Official evaluation of NER and NEN systems on the test data

*Entity*	*CRF-based combined entity model*	*CRF-based independent entity model*	*CNN-BiLSTM-CRF model*	*Best performing system*	*References*
Cell	0.600 / 0.576 / 0.588	0.630 / 0.664 / 0.647	0.674 / 0.610 / 0.641	0.784 / 0.557 / 0.651	Sheng et al. (14)
Cellular	0.456 / 0.371 / 0.410	0.404 / 0.423 / 0.413	0.391 / 0.376 / 0.383	0.550 / 0.450 / 0.495	Sheng et al. (14)
Protein	0.472 / 0.343 / 0.397	0.456 / 0.358 / 0.401	0.445 / 0.388 / 0.415	0.472 / 0.343 / 0.397	Our submitted system
Organism	0.668 / 0.667 / 0.667	0.753 / 0.725 / 0.739	0.761 / 0.703 / 0.731	0.660 / 0.883 / 0.756	Singh and Dai (34)
Molecule	0.244 / 0.240 / 0.242	0.439 / 0.489 / 0.462	0.460 / 0.456 / 0.458	0.587 / 0.473 / 0.524	Sheng et al. (14)
Tissue	0.531 / 0.490 / 0.510	0.427 / 0.565 / 0.486	0.451 / 0.542 / 0.493	0.531 / 0.490 / 0.510	Our submitted system

*Entity*	*CRF-based combined entity model*	*CRF-based independent entity model*	*CNN-BiLSTM-CRF model*	*Best performing system*	*References*
Cell	0.600 / 0.576 / 0.588	0.630 / 0.664 / 0.647	0.674 / 0.610 / 0.641	0.784 / 0.557 / 0.651	Sheng et al. (14)
Cellular	0.456 / 0.371 / 0.410	0.404 / 0.423 / 0.413	0.391 / 0.376 / 0.383	0.550 / 0.450 / 0.495	Sheng et al. (14)
Protein	0.472 / 0.343 / 0.397	0.456 / 0.358 / 0.401	0.445 / 0.388 / 0.415	0.472 / 0.343 / 0.397	Our submitted system
Organism	0.668 / 0.667 / 0.667	0.753 / 0.725 / 0.739	0.761 / 0.703 / 0.731	0.660 / 0.883 / 0.756	Singh and Dai (34)
Molecule	0.244 / 0.240 / 0.242	0.439 / 0.489 / 0.462	0.460 / 0.456 / 0.458	0.587 / 0.473 / 0.524	Sheng et al. (14)
Tissue	0.531 / 0.490 / 0.510	0.427 / 0.565 / 0.486	0.451 / 0.542 / 0.493	0.531 / 0.490 / 0.510	Our submitted system

While the increase in NER system performance can be valuable for some entities, the effect of NER on normalization performance can be detrimental as well. As shown in Table 3, our NER systems with a slight increase in F-score on Tissue recognition has negative impact on normalization by lowering the F-score by 1–2 pp. One potential explanation is that the improved NER system also finds seemingly correct entities, which nevertheless are not considered correct according to the annotation guidelines. For example. in a phrase ‘smooth muscle basement membranes’, our system recognizes ‘smooth muscle’ as a Tissue entity and the normalization model is also able to find a corresponding identifier for it, but this is not considered to be an entity in the gold standard annotations as it is seen as a modifier for the ‘basement membranes’ Cellular entity. The combined NER model has plausibly learned this type of dependency between entity types and avoids many of these errors, whereas the independent entity type-specific NER models are not aware of the surrounding entities. Thus, it would be beneficial to take into account the normalization performance while optimizing the NER system as the recognized entities with missing or incorrect identifiers may be harmful for the real applications relying on the extracted information.

Our integrated system has moderate performance overall. Whereas the NER component achieves high F-scores compared to other systems submitted to the shared task, the normalization systems performance is still lagging for most entity types. While fuzzy string matching has good results on some of the entities, the result can be rather different as shown by the large variance in F-score (>33 pp) on different entity types. This signifies that the approach is not universally good for all types of entities, and other approaches, such as TF-IDF weighting, preprocessing and post-processing, steps should be also considered.

Conclusions and future work

We approach BioCreative Bio-ID task by training a CRF-based model to recognize biomedical entities and we link them to their corresponding database identifiers using approximate pattern matching algorithm. For Protein and Organism entities, we utilize the ontology structure and surrounding context to disambiguate the entities with multiple identifier candidates.

Our CRF-based NER systems demonstrate a notable performance overall, achieving the best score out of all systems submitted to the Bio-ID task for all entity types, exceeding the performance of the second best systems by 8 to 18 percentage points depending on the entity type. In this extended study we have further improved the system with a more fine-grained hyperparameter optimization specific to each entity type. This approach leads to a significant improvement in performance for the less common entity types without sacrificing the overall performance.

We have also explored the possibility of correcting the predictions with a character level neural model stacked on top of the CRF model. The results suggest that such model can potentially offer considerable performance improvements, yet overfits easily to the development set. As a future work we will look into better ways of regularizing the network as well as consider the possibility of solely character level modeling, dismissing the ensemble approach.

Our NEN system submitted to the shared task demonstrated a lagging performance for Protein, Cellular and Molecule when compared with other entities. In particular, for Cellular entities the best performing systems are able to achieve up to 11 pp higher F-scores in the official evaluation. In this work, we have improved our system on all entity type by improving abbreviation resolution. For Protein normalization, even though the performance of the system is slightly increased by Organism assignment, we suspect that strict string matching criteria might be an important factor in limiting the system performance.

Our current normalization system is somewhat limited as it applies several manually generated rules which do not generalize to normalizing other entity types, hindering the ability of applying the same approach for entity types outside the scope of the BioCreative Bio-ID task. Thus developing a machine learning system that can be trained on the annotations of new entity type would be an ideal solution for the normalization task. Since the conventions of naming biomedical entities as well as the dependence on the surrounding context vary among entity types, a unified normalization system can be a challenging task.

Acknowledgements

Computational resources are provided by CSC-IT Center For Science Ltd, Espoo, Finland.

Funding

ATT Tieto käyttöön grant.

Conflict of interest. None declared.

Database URL:https://github.com/TurkuNLP/BioCreativeVI_BioID_assignment

References

Delėger

Bossy

Chaix

et al. (

2016

)

Overview of the bacteria biotope task at BioNLP shared task 2016

. In:

Proceedings of the 4th BioNLP Shared Task Workshop, Berlin, Germany

, 13 August 2016.

Association of Computational Linguistics

, pp.

–

Kim

Ohta

Tsuruoka

. et al. (

2004

)

Introduction to the bio-entity recognition task at JNLPBA

. In:

Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, University of Geneva, Switzerland, 28–29 August 2004

Association for Computational Linguistics

. pp.

–

Arighi

C.N.

Hirschman

Thomas

et al. (

2017

)

Bio-ID track overview

. In:

Proceedings of BioCreative VI Workshop, Bethesda, MD, USA

. pp.

–

Ding

Arighi

C.N.

Lee

et al. (

2015

)

pGenN, a gene normalization tool for plant genes and proteins in scientific literature

PLoS One

–

e0135305

Habibi

Weber

Neves

et al. (

2017

)

Deep learning with word embeddings improves biomedical named entity recognition

Bioinformatics

i48

Kaewphan

Van Landeghem

Ohta

et al. (

2015

)

Cell line name recognition in support of the identification of synthetic lethality in cancer from text

Bioinformatics

276

–

282

PubMed

Pyysalo

and

Ananiadou

(

2013

)

Anatomical entity mention recognition at literature scale

Bioinformatics

868

–

875

Mehryary

Hakala

Kaewphan

et al. (

2017

)

End-to-end system for bacteria habitat extraction

. In:

Proceedings of the 16th BioNLP Workshop, Vancouver, Canada, 4 August 2017

Association for Computational Linguistics

–

Wei

C.H.

Kao

H.Y.

and

(

2015

)

GNormPlus: an integrative approach for tagging genes, gene families, and protein domains

Biomed Res. Int.

–

918710

10.

Chen

Tang

et al. (

2017

)

CNN-based ranking for biomedical entity normalization

BMC Bioinf.

385

11.

Limsopatham

and

Collier

N.H.

(

2016

)

Normalising medical concepts in social media texts by learning semantic representation

. In:

Proceedings of the Fifth Workshops on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2016), Osaka, Japan, 12 December 2016

The COLING 2016 Organizing Commitee

–

12.

Leaman

and

(

2016

)

TaggerOne: joint named entity recognition and normalization with semi-Markov Models

Bioinformatics

2839

–

2846

Sheng

Miller

Ambite

J.S.

et al. (

2017

)

A neural named entity recognition approach to biological entity identification

. In:

Proceedings of the BioCreative VI Workshop

Bethesda, MD, USA

–

Dai

H.J.

and

Singh

(

2018

)

SPRENO: a BioC module for identifying organism terms in figure captions

Database

2018

–

bay048

15.

Kaewphan

Mehryary

Hakala

et al. (

2017

)

TurkuNLP entry for interactive Bio-ID assignment

. In:

Proceedings of the BioCreative VI Workshop

Bethesda, MD, USA

, pp.

–

16.

Pyysalo

Ginter

Moen

et al. (

2013

)

Distributional semantics resources for biomedical text processing

. In:

Proceedings of the 5th International Symposium on Languages in Biology and Medicine,

Tokyo, Japan

, 12–13 December 2013. pp.

–

17.

Saetre

Yoshida

Yakushiji

et al. (

2007

)

AKANE system: protein-protein interaction pairs in BioCreAtIvE2 challenge, PPI-IPS subtask

. In:

Proceedings of the Second Biocreative Challenge Evaluation Workshop,

Madrid, Spain

, 23–25 April 2007. pp.

209

–

211

18.

Tsuruoka

Tateishi

Kim

J.D

. et al. (

2005

)

Developing a robust part-of-speech tagger for biomedical text

. In: Bozanis P and Houstis EN (eds.)

Advances in Informatics: 10th Panhellenic Conference on Informatics, PCI, Valos, Greece, 11–13 November 2005

. PCI 2005 LNCS, Vol. 3746.

Springer

. pp.

382

–

392

19.

Degtyarenko

De Matos

Ennis

et al. (

2007

)

ChEBI: a database and ontology for chemical entities of biological interest

Nucleic Acids Res.

D350

20.

Bolton

E.E.

Wang

Thiessen

P.A.

et al. (

2008

)

PubChem: integrated platform of small molecules and biological activities

Annu. Rep. Comput. Chem.

217

–

241

21.

Brown

G.R.

Hem

Katz

K.S.

et al. (

2014

)

Gene: a gene-centered information resource at NCBI

Nucleic Acids Res.

D42

22.

UniProt Consortium

. (

2014

)

UniProt: a hub for protein information

Nucleic Acids Res

D204

–

D212

gku989

23.

Federhen

(

2011

)

The NCBI taxonomy database

Nucleic Acids Res.

D143

24.

Mungall

C.J.

Torniai

Gkoutos

G.V.

et al. (

2012

)

Uberon, an integrative multi-species anatomy ontology

Genome Biol.

25.

Gene Ontology Consortium

(

2004

)

The Gene Ontology (GO) database and informatics resource

Nucleic Acids Res.

D261

26.

Fauquet

C.M.

and

Pringle

C.R.

(

1999

)

Abbreviations for invertebrate virus species names

Arch. Virol.

144

2265

–

2271

27.

Klein

Smarr

Nguyen

et al. (

2003

)

Named entity recognition with character-level models

. In:

Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, Edmonton, Canada, 31 May 2003

Association for Computational Linguistics

28.

Kuru

Can

O.A.

and

Yuret

(

2016

)

Charner: character-level named entity recognition

. In:

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016

The COLING 2016 Organizing Committee

29.

and

Hovy

(

2016

)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

. In:

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016

Association for Computational Linguistics

. pp.

1064

–

1074

30.

Okazaki

and

Tsujii

(

2010

)

Simple and efficient algorithm for approximate dictionary matching

. In:

Proceedings of the 23rd International Conference on Computational Linguistics

Association for Computational Linguistics

. pp.

851

–

859

31.

Sohn

Comeau

D.C.

Kim

et al. (

2008

)

Abbreviation definition identification based on automatic precision estimates

BMC Bioinf.

402

32.

Chen

Liu

and

Friedman

(

2005

)

Gene name ambiguity of eukaryotic nomenclatures

Bioinformatics

248

–

256

Van Landeghem

Ginter

Van de Peer

et al. (

2011

)

EVEX: a PubMed-scale resource for homology-based generalization of text mining predictions

. In:

Proceedings of the 2011 Workshop on Biomedical Natural Language Processing, Portland, Oregon, USA, 23–24 June 2011

Association for Computational Linguistics,

pp.

–

34.

Singh

and

Dai

H.J.

(

2017

)

SPRENO: A BioC module for recognizing and normalizing species and their model organisms

. In:

Proceedings of the BioCreative VI Workshop, Bethesda, MD, USA

. pp.

–