A probabilistic automated tagger to identify human-related publications Open Access

Forward selection process results, showing best performing feature included at each stage using 5 × 2 cross-validation on the training data set

Stage	Feature	AUC	MCC
1	Abstract Bigrams	0.955	0.771
2	Abstract Unigrams	0.967	0.813
3	Journal Name	0.969	0.823
4	Title Unigrams	0.972	0.831
5	Title Bigrams	0.973	0.834
6	Abstract Trigrams	0.974	0.837

Stage	Feature	AUC	MCC
1	Abstract Bigrams	0.955	0.771
2	Abstract Unigrams	0.967	0.813
3	Journal Name	0.969	0.823
4	Title Unigrams	0.972	0.831
5	Title Bigrams	0.973	0.834
6	Abstract Trigrams	0.974	0.837

Table 1

Forward selection process results, showing best performing feature included at each stage using 5 × 2 cross-validation on the training data set

Stage	Feature	AUC	MCC
1	Abstract Bigrams	0.955	0.771
2	Abstract Unigrams	0.967	0.813
3	Journal Name	0.969	0.823
4	Title Unigrams	0.972	0.831
5	Title Bigrams	0.973	0.834
6	Abstract Trigrams	0.974	0.837

Stage	Feature	AUC	MCC
1	Abstract Bigrams	0.955	0.771
2	Abstract Unigrams	0.967	0.813
3	Journal Name	0.969	0.823
4	Title Unigrams	0.972	0.831
5	Title Bigrams	0.973	0.834
6	Abstract Trigrams	0.974	0.837

Table 2

Comparison of performance results predicted by cross-validation and actual results predicted on the test data set

Dataset	AUC	MCC	F1	Recall	Precision	Brier Score	Error Rate
Training	0.975	0.841	0.944	0.940	0.949	0.059	0.073
Test	0.976	0.833	0.950	0.946	0.955	0.056	0.070

Dataset	AUC	MCC	F1	Recall	Precision	Brier Score	Error Rate
Training	0.975	0.841	0.944	0.940	0.949	0.059	0.073
Test	0.976	0.833	0.950	0.946	0.955	0.056	0.070

Table 2

Open in new tab Download slide

Comparison of performance results predicted by cross-validation and actual results predicted on the test data set

Dataset	AUC	MCC	F1	Recall	Precision	Brier Score	Error Rate
Training	0.975	0.841	0.944	0.940	0.949	0.059	0.073
Test	0.976	0.833	0.950	0.946	0.955	0.056	0.070

Dataset	AUC	MCC	F1	Recall	Precision	Brier Score	Error Rate
Training	0.975	0.841	0.944	0.940	0.949	0.059	0.073
Test	0.976	0.833	0.950	0.946	0.955	0.056	0.070

The new features investigated (word count, punctuation symbol count, title numeric term count and page count) were not found to improve the final predictive model. However, even without these new features providing additional value, the model is highly accurate and demonstrates that the approach used to create the prior RCT (Randomized Controlled Trial) tagger also works well for the Humans tag assignment.

Overall the performance figures on the test data set correlate very closely with the cross-validation estimates. Calibration of the model to the actual Humans MeSH assignment was very good, as shown in Figure 1. Figure 1 is a calibration plot of the probability predictions on the test data set, as compared to the overall proportion of MeSH Humans assignments for articles at each level of predicted probability. The plot shows that the model slightly over-predicts the proportion for probability values between 0.10 and 0.50 and slightly under-predicts the proportion for probability scores between 0.50 and 0.90. Calibration is almost perfect at the extremes. There appears to be no overall bias in the calibration as the under and over predictions are about the same. The adjusted R² statistic between the predictions and the actual MeSH Human tag proportions is 0.982.

Figure 1

Probabilistic tagger confidence score calibration plot. The x-axis represents the predicted probability score, and the y-axis shows the proportion of articles within a similar probability score range that were assigned the Humans MeSH term. Numbers next to the dots show the number of samples included in the probability score range used to calculate the MeSH Humans proportion. The dotted line x = y shows perfect calibration for comparison.

The distribution of the model probability estimates for the human tagger for the test set articles with and without the Humans MeSH term is shown in Figures 2 and 3, respectively. Figure 2 shows that the vast majority of articles that are assigned the MeSH Humans term are scored very highly by the tagger, typically >0.95, with extremely few of these articles scored <0.50. Figure 3 shows that the vast majority of articles that are NOT assigned the MeSH Humans term are scored very low by the tagger, typically <0.10, with a monotonically decreasing amount assigned between 0.10 and 0.90. Interestingly, there is a small increase in negative MeSH Humans articles scored >0.90. Looking closely at Figure 2, there is also a very small bump on positive MeSH articles scoring <0.10.

Figure 2

Probabilistic tagger predicted probability score distribution over articles in the test set, consisting of articles published in 2015–2016 and assigned the Humans MeSH term. Shows the distribution of the probability estimates of these articles as predicted by our model versus the percentage of articles in the test set assigned the MeSH Humans term.

Open in new tab Download slide

Figure 3

Probabilistic tagger predicted probability score distribution over articles in the test set, consisting of articles published in 2015–2016 and NOT assigned the Humans MeSH term. Shows the distribution of the probability estimates of these articles as predicted by our model versus the percentage of articles in the test set NOT assigned the MeSH Humans term.

Open in new tab Download slide

The ‘extreme disagreements’ between the tagger and the assigned MeSH tags were manually reviewed. One hundred articles were randomly chosen in which articles lacking Humans MeSH were scored >0.99 by our model; these samples represent the ‘bump’ in the histogram near 1.0 in Figure 3. In addition, 100 articles, which received Humans MeSH indexing but to which our model gave scores <0.01, were randomly chosen. These articles represent the small ‘bump’ in the histogram near 0.0 in Figure 2. When the model gave low predictive scores (<0.01), the manual reviewer agreed with the model 97% of the time. On the other hand, when the tagger predicted high scores (>0.99) in articles lacking Humans MeSH, the manual reviewer only agreed with the model 50% of the time. Overall, manual review agreed with the model in 73.5% of cases. Therefore, according to the manual review, when the model gives an article a low predictive score, the article is almost certainly not about humans. However, in some cases a high score will be assigned to articles that are in fact not about humans. This is consistent with the probabilistic interpretation of the tag. See Table 3.

Table 3

Comparison of manual review for cases of extreme disagreement between the MEDLINE assigned Humans MeSH term and the model’s predictive probability scores. One hundred cases of extreme prediction disagreement were selected randomly from articles with the MEDLINE Humans assignment but predictive tagger probabilities <0.01, and another 100 cases lacking the MEDLINE Humans term but having predictive tagger probabilities >0.99

	Manual review
Disagreement type	Humans	Not Humans	Uncertain	Totals
Humans MeSH term assigned, tagger probability score < 0.01	2	97	1	100
Humans MeSH term not assigned, tagger probability score > 0.99	50	41	9	100
Totals	52	138	10	200

	Manual review
Disagreement type	Humans	Not Humans	Uncertain	Totals
Humans MeSH term assigned, tagger probability score < 0.01	2	97	1	100
Humans MeSH term not assigned, tagger probability score > 0.99	50	41	9	100
Totals	52	138	10	200

Table 3

	Manual review
Disagreement type	Humans	Not Humans	Uncertain	Totals
Humans MeSH term assigned, tagger probability score < 0.01	2	97	1	100
Humans MeSH term not assigned, tagger probability score > 0.99	50	41	9	100
Totals	52	138	10	200

	Manual review
Disagreement type	Humans	Not Humans	Uncertain	Totals
Humans MeSH term assigned, tagger probability score < 0.01	2	97	1	100
Humans MeSH term not assigned, tagger probability score > 0.99	50	41	9	100
Totals	52	138	10	200

Discussion

The probabilistic automated tagger performs with very high accuracy and calibration and gives similar results as that of MEDLINE curators overall. The high frequency and heterogeneous nature of human-related articles proved not to be a substantial problem for our machine-learning method. The differences between estimated cross-validation performance on the training set and performance on the held out test set were small and approximately evenly distributed in direction. Article metadata features alone were enough to reach high performance.

The majority of the tagger scores are quite binary, either <0.05 or >0.95. Still, a substantial fraction of articles fall into the middle range. Almost 50% of articles in the test set that do not have the Humans MeSH term score between 0.05 and 0.95. For articles that do have the Humans MeSH term assigned, the proportion of articles in the middle is less, but still notably ∼20%. These did appear to represent cases that were borderline for some reason (e.g. a review of animal models of human disease with relevance for potential treatments in man). It is important to provide the user with customizable tools in order to handle these articles in a manner appropriate for their specific use case. No binary tag assignment tool can offer a similar level of flexibility.

As a simple example, consider a researcher looking for narrative articles about humans in PubMed. One publication type of interest would be Autobiographies, which logically should also have the Humans MeSH term. A recent PubMed search found 3399 articles indexed as Autobiography. Of these, only 1376 also have the Humans MeSH term. The Autobiographies lacking the Humans MeSH term include seemingly obvious human-related articles such as ‘An interview with Claudio Stern’ (12), ‘Laura Frontali-my life with yeast’ (13) and ‘Autobiography of J. Andrew McCammon’ (14). These articles are scored 0.31, 0.19 and 0.40, respectively, with the probabilistic Human tagger. Perhaps an interview is less about Humans than about a specific human. Also, perhaps a ‘life with yeast’ may be more about yeast than about humans. Certainly there is some gray area and perhaps inconsistency about what constitutes a human article. A user searching for Autobiographies and limiting the search to articles having the Humans MeSH term would miss half the autobiographies. Removing the Humans MeSH term requirement would include 47 articles tagged with the MeSH term Bacteria and not the MeSH term Humans. The user would have no fine-grained controllable search options to address this problem. With a probabilistic tagger, a threshold of 0.10 would pick up all three of these example biographies as human articles. For a user requiring a much stricter definition, perhaps Autobiographies that are about the personal lives of the human beings instead of their work, a threshold of 0.50 would exclude all of the three example biographies. This underscores the need for flexible tools, customizable to different use cases.

We examined cases of extreme disagreement between the MEDLINE MeSH assignment and the model’s predictive scores. An independent blinded human expert found that when the model predicted NON-HUMAN, the article was almost always NON-HUMAN. However, when the MeSH term was not assigned but the model predicted HUMAN with high confidence, the expert agreed with the model only in half the articles. This suggests that a user who employs the model to retrieve human-related articles may safely discard articles having predictive scores below 0.01. The probabilistic nature of the tag along with the good level of calibration ensures that there will be a controllable proportion of false positives and false negatives at any chosen probability threshold. Since the threshold is customizable by the user for their specific purposes, the impact of these false positives and false negatives on workload should be small.

While the main results for the Humans tagger presented here are the probabilistic tagger evaluation measures of AUC and Brier score, the binary outcome measures are also highly accurate and compare favorably with prior binary label prediction work. The F1 measure on the test set here is 0.950. Yepes reported a maximum F1 of 0.9337 for the MeSH Humans term (8).

The Human probabilistic tagger is being used by our team to assign predictive scores to all articles indexed in PubMed, including newly published articles, and have been made public for download on our project website (http://arrowsmith.psych.uic.edu). As well, the model is used to assign predictive scores to articles that are retrieved through Metta (15), which carries out a unified high recall, de-duplicated retrieval of records not only from PubMed but also from EMBASE, CINAHL Plus, PsycINFO and the Cochrane Central Register of Controlled Trials. The Humans probabilistic tag will be supplemented by RCT predictive scores (2) and other automated publication type, study design and attribute taggers that are currently under development.

Conclusion

The predictive model described here was highly accurate as evaluated by both a large-scale comparison with MEDLINE as well as manual expert review, achieving accuracy comparable to that of MeSH indexing itself. We have tagged with our predictive scores all articles in PubMed from 1987 through 2017 and are tagging newly published articles weekly as they appear. Using our automated tagging approach, most of these new articles will be tagged by our Humans probabilistic model prior to review for annotation by the MEDLINE indexers. The current database of articles tagged with the Humans probabilistic model is available at http://arrowsmith.psych.uic.edu/evidence_based_medicine/index.html.

This information will assist in the triage of clinical evidence during the initial phase of writing systematic reviews and also help ensure that the update process has ready access to the latest published articles.

Acknowledgements

The authors wish to acknowledge Gary Bonifield and Prerna Das for computational and programming support of this work.

Funding

National Institutes of Health/National Library of Medicine (R01LM010817).

Conflict of interest. None declared.

Database URL: http://clingen.igib.res.in/sage.

References

U.S. National Library of Medicine. Introduction to MeSH

https://www.nlm.nih.gov/mesh/introduction.html (30 Jul 2018, is date last accessed)

Cohen

A.M.

Smalheiser

N.R.

McDonagh

M.S.

et al. (

2015

)

Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine

J. Am. Med. Inform. Assoc.

707

–

717

Wieland

L.S.

Robinson

K.A.

and

Dickersin

(

2012

)

Understanding why evidence from randomised clinical trials may not be retrieved from Medline: comparison of indexed and non-indexed records

BMJ

344

d7501

Aronson

A.R.

Mork

J.G.

Gay

C.W.

et al. (

2004

)

The NLM indexing initiative’s medical text indexer

Stud. Health. Technol. Inform.

107

707

–

717

Tsatsaronis

Balikas

Malakasiotis

et al. (

2015

)

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition

BMC Bioinformatics.

138

Shengweng

You

Wang

et al. (

2016

)

DeepMeSH: deep semantic representation for improving large-scale MeSH indexing

Bioformatics

i70

–

i79

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908368/ (12 February 2018, date last accessed)

Liu

Peng

et al. (

2015

)

MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence

Bioinformatics

i339

–

i347

Yepes

A.J.J.

Mork

J.G.

Demner-Fushman

et al. (

2013

)

Comparison and combination of several MeSH indexing approaches

AMIA Annu. Symp. Proc.

2013

709

–

718

PubMed

Elliott

J.H.

Turner

Clavisi

et al. (

2014

)

Living systematic reviews: an emerging opportunity to narrow the evidence-practice gap

PLoS Med.

e1001603

10.

Fan

R.E.

Chang

K.W.

Hsieh

C.J.

et al. (

2008

)

LIBLINEAR: a library for large linear classification

J. Mach. Learn. Res.

1871

–

1874

11.

Rüping

(

2004

)

A simple method for estimating conditional probabilities for svms

SFB 475 Komplexitätsreduktion in Multivariaten Datenstrukturen

. Technical report. Universität Dortmund.

12.

Grewal

(

2017

)

An interview with Claudio Stern

Dev. Camb. Engl.

144

4473

–

4475

13.

Frontali

(

2017

)

Laura Frontali-my life with yeast

FEMS Yeast Res.