CARD*Shark: automated prioritization of literature curation for the Comprehensive Antibiotic Resistance Database

CARD*Shark 2 and supervised learning cross-validation precision-recall statistics

Model name	Feature extraction method	Precision	Recall	F₁
Logistic regression	TF–IDF bi/trigram	0.96	0.41	0.58
	TF–IDF word	0.92	0.81	0.86
	BOW	0.89	0.88	0.89
Naive Bayes	TF–IDF bi/trigram	1.00	0.26	0.41
	TF–IDF word	0.99	0.36	0.52
	BOW	0.74	0.96	0.83
Random forest	TF–IDF bi/trigram	0.83	0.60	0.70
	TF–IDF word	0.89	0.66	0.76
	BOW	0.90	0.66	0.76
Extreme gradient boosting	TF–IDF bi/trigram	0.84	0.65	0.74
	TF–IDF word	0.89	0.81	0.85
	BOW	0.89	0.81	0.85
Support vector machine	TF–IDF bi/trigram	-	0.00	0.00
	TF–IDF word	-	0.00	0.00
	BOW	0.96	0.15	0.26
CARD*Shark 2		0.38	0.97	0.54

Model name	Feature extraction method	Precision	Recall	F₁
Logistic regression	TF–IDF bi/trigram	0.96	0.41	0.58
	TF–IDF word	0.92	0.81	0.86
	BOW	0.89	0.88	0.89
Naive Bayes	TF–IDF bi/trigram	1.00	0.26	0.41
	TF–IDF word	0.99	0.36	0.52
	BOW	0.74	0.96	0.83
Random forest	TF–IDF bi/trigram	0.83	0.60	0.70
	TF–IDF word	0.89	0.66	0.76
	BOW	0.90	0.66	0.76
Extreme gradient boosting	TF–IDF bi/trigram	0.84	0.65	0.74
	TF–IDF word	0.89	0.81	0.85
	BOW	0.89	0.81	0.85
Support vector machine	TF–IDF bi/trigram	-	0.00	0.00
	TF–IDF word	-	0.00	0.00
	BOW	0.96	0.15	0.26
CARD*Shark 2		0.38	0.97	0.54

Results from 5-fold cross-validation of several supervised learning models using three feature extraction methods, plus CARD*Shark 2. Underlined numbers represent the best-performing method for each category. For CARD*Shark 2, high-level predictions are considered positive predictions, and low-level predictions are considered negative predictions.

Table 1.

CARD*Shark 2 and supervised learning cross-validation precision-recall statistics

Model name	Feature extraction method	Precision	Recall	F₁
Logistic regression	TF–IDF bi/trigram	0.96	0.41	0.58
	TF–IDF word	0.92	0.81	0.86
	BOW	0.89	0.88	0.89
Naive Bayes	TF–IDF bi/trigram	1.00	0.26	0.41
	TF–IDF word	0.99	0.36	0.52
	BOW	0.74	0.96	0.83
Random forest	TF–IDF bi/trigram	0.83	0.60	0.70
	TF–IDF word	0.89	0.66	0.76
	BOW	0.90	0.66	0.76
Extreme gradient boosting	TF–IDF bi/trigram	0.84	0.65	0.74
	TF–IDF word	0.89	0.81	0.85
	BOW	0.89	0.81	0.85
Support vector machine	TF–IDF bi/trigram	-	0.00	0.00
	TF–IDF word	-	0.00	0.00
	BOW	0.96	0.15	0.26
CARD*Shark 2		0.38	0.97	0.54

Model name	Feature extraction method	Precision	Recall	F₁
Logistic regression	TF–IDF bi/trigram	0.96	0.41	0.58
	TF–IDF word	0.92	0.81	0.86
	BOW	0.89	0.88	0.89
Naive Bayes	TF–IDF bi/trigram	1.00	0.26	0.41
	TF–IDF word	0.99	0.36	0.52
	BOW	0.74	0.96	0.83
Random forest	TF–IDF bi/trigram	0.83	0.60	0.70
	TF–IDF word	0.89	0.66	0.76
	BOW	0.90	0.66	0.76
Extreme gradient boosting	TF–IDF bi/trigram	0.84	0.65	0.74
	TF–IDF word	0.89	0.81	0.85
	BOW	0.89	0.81	0.85
Support vector machine	TF–IDF bi/trigram	-	0.00	0.00
	TF–IDF word	-	0.00	0.00
	BOW	0.96	0.15	0.26
CARD*Shark 2		0.38	0.97	0.54

Table 2.

Model validation results and predictions for November and September 2019 papers

Model name	Feature extraction method	FN	FP	TN	TP	Precision	Recall	F₁	Negative predictions	Positive predictions
Logistic regression	TF–IDF bi/trigram	7	6	397	1	0.14	0.13	0.13	241 618	187
	TF–IDF word	4	42	361	4	0.09	0.50	0.15	240 773	1032
	BOW	4	96	307	4	0.04	0.50	0.07	237 149	4.656
Naive Bayes	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 796	9
	TF–IDF word	8	0	403	0		0.00	0.00	241 788	17
	BOW	1	92	311	7	0.07	0.88	0.13	237 126	4679
Random forest	TF–IDF bi/trigram	3	33	370	5	0.13	0.63	0.22	240 570	1235
	TF–IDF Word	7	30	373	1	0.03	0.13	0.05	240 163	1642
	BOW	5	38	365	3	0.07	0.38	0.12	240 023	1782
Extreme gradient boosting	TF–IDF bi/trigram	5	28	375	3	0.10	0.38	0.15	240 882	923
	TF–IDF word	4	31	372	4	0.11	0.50	0.19	240 165	1640
	BOW	6	27	376	2	0.07	0.25	0.11	240 290	1515
Support vector machine	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 805	0
	TF–IDF word	8	0	403	0		0.00	0.00	241 805	0
	BOW	7	2	401	1	0.33	0.13	0.18	241 742	63
CARD*Shark 2		0	165	238	8	0.05	1	0.09	200 906	40 899

Model name	Feature extraction method	FN	FP	TN	TP	Precision	Recall	F₁	Negative predictions	Positive predictions
Logistic regression	TF–IDF bi/trigram	7	6	397	1	0.14	0.13	0.13	241 618	187
	TF–IDF word	4	42	361	4	0.09	0.50	0.15	240 773	1032
	BOW	4	96	307	4	0.04	0.50	0.07	237 149	4.656
Naive Bayes	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 796	9
	TF–IDF word	8	0	403	0		0.00	0.00	241 788	17
	BOW	1	92	311	7	0.07	0.88	0.13	237 126	4679
Random forest	TF–IDF bi/trigram	3	33	370	5	0.13	0.63	0.22	240 570	1235
	TF–IDF Word	7	30	373	1	0.03	0.13	0.05	240 163	1642
	BOW	5	38	365	3	0.07	0.38	0.12	240 023	1782
Extreme gradient boosting	TF–IDF bi/trigram	5	28	375	3	0.10	0.38	0.15	240 882	923
	TF–IDF word	4	31	372	4	0.11	0.50	0.19	240 165	1640
	BOW	6	27	376	2	0.07	0.25	0.11	240 290	1515
Support vector machine	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 805	0
	TF–IDF word	8	0	403	0		0.00	0.00	241 805	0
	BOW	7	2	401	1	0.33	0.13	0.18	241 742	63
CARD*Shark 2		0	165	238	8	0.05	1	0.09	200 906	40 899

TP, TN, FP, FN, precision, recall and F₁ values represent validation of a random subset of papers by human curators, with underlined numbers representing the best-performing method for precision, recall and F₁. Negative predictions represent the number of papers that CARD curators would ignore, while positive predictions are the number of papers requiring review by CARD curators for possible new additions to CARD. For CARD*Shark 2, high-level predictions are considered positive predictions, and low-level predictions are considered negative predictions. TP, true positive; TN, true negative; FP, false positive; FN, false negative.

Table 2.

Model validation results and predictions for November and September 2019 papers

Model name	Feature extraction method	FN	FP	TN	TP	Precision	Recall	F₁	Negative predictions	Positive predictions
Logistic regression	TF–IDF bi/trigram	7	6	397	1	0.14	0.13	0.13	241 618	187
	TF–IDF word	4	42	361	4	0.09	0.50	0.15	240 773	1032
	BOW	4	96	307	4	0.04	0.50	0.07	237 149	4.656
Naive Bayes	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 796	9
	TF–IDF word	8	0	403	0		0.00	0.00	241 788	17
	BOW	1	92	311	7	0.07	0.88	0.13	237 126	4679
Random forest	TF–IDF bi/trigram	3	33	370	5	0.13	0.63	0.22	240 570	1235
	TF–IDF Word	7	30	373	1	0.03	0.13	0.05	240 163	1642
	BOW	5	38	365	3	0.07	0.38	0.12	240 023	1782
Extreme gradient boosting	TF–IDF bi/trigram	5	28	375	3	0.10	0.38	0.15	240 882	923
	TF–IDF word	4	31	372	4	0.11	0.50	0.19	240 165	1640
	BOW	6	27	376	2	0.07	0.25	0.11	240 290	1515
Support vector machine	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 805	0
	TF–IDF word	8	0	403	0		0.00	0.00	241 805	0
	BOW	7	2	401	1	0.33	0.13	0.18	241 742	63
CARD*Shark 2		0	165	238	8	0.05	1	0.09	200 906	40 899

Model name	Feature extraction method	FN	FP	TN	TP	Precision	Recall	F₁	Negative predictions	Positive predictions
Logistic regression	TF–IDF bi/trigram	7	6	397	1	0.14	0.13	0.13	241 618	187
	TF–IDF word	4	42	361	4	0.09	0.50	0.15	240 773	1032
	BOW	4	96	307	4	0.04	0.50	0.07	237 149	4.656
Naive Bayes	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 796	9
	TF–IDF word	8	0	403	0		0.00	0.00	241 788	17
	BOW	1	92	311	7	0.07	0.88	0.13	237 126	4679
Random forest	TF–IDF bi/trigram	3	33	370	5	0.13	0.63	0.22	240 570	1235
	TF–IDF Word	7	30	373	1	0.03	0.13	0.05	240 163	1642
	BOW	5	38	365	3	0.07	0.38	0.12	240 023	1782
Extreme gradient boosting	TF–IDF bi/trigram	5	28	375	3	0.10	0.38	0.15	240 882	923
	TF–IDF word	4	31	372	4	0.11	0.50	0.19	240 165	1640
	BOW	6	27	376	2	0.07	0.25	0.11	240 290	1515
Support vector machine	TF–IDF bi/trigram	8	0	403	0		0.00	0.00	241 805	0
	TF–IDF word	8	0	403	0		0.00	0.00	241 805	0
	BOW	7	2	401	1	0.33	0.13	0.18	241 742	63
CARD*Shark 2		0	165	238	8	0.05	1	0.09	200 906	40 899

To gain a better understanding of each model’s performance in a real-world application, each model made predictions on all papers for September and November 2019 (Table 2). A random subset of these papers was used for independent human validation. The main goal of these models was to achieve a high recall value as missing a novel ARG because of a poorly performing model is detrimental to CARD’s overall objectives. The secondary objective of these models was a high precision to reduce the number of papers curators must review. Based on human validation, all the supervised learning models resulted in low precision values of <34%, while naive Bayes obtained the highest recall of 88% (Table 2). Despite high recall by naive Bayes, low precision is undesirable as it would result in too many papers for manual review. If we instead focus on models with a good balance between precision and recall via F₁ values, we find logistic regression, naive Bayes, random forest and extreme gradient boosting each have one model that performs with the highest F₁ score, with random forest on TF–IDF on bi/trigrams having the best overall performance (Table 1). CARD*Shark 2 performs with the best recall (100%) at the cost of having the third lowest precision. The impact of CARD*Shark 2’s low precision can be seen in the 40 899 papers it flags for curator review, more than an order of magnitude higher than any of the supervised learning methods. Future improvements to precision and recall may require the use of an ensemble of the best-performing models. Notably, only 430 papers were selected for human validation out of a set of >200 000 papers, and as such, we cannot definitively conclude that one model is better than another until more papers are evaluated. A more significant issue faced by both CARD*Shark 2 and the supervised learning models is that we do not know the extent of relevant papers being ignored as CARD does not keep a record of negative curations. Moving forward, it would be advisable to mix a subsample of negative predictions into the curation set to evaluate ignored essential papers.

Continued evaluation of logistic regression, random forest and naive Bayes is being performed through monthly paper predictions that are assessed by CARD’s team of curators. Additionally, a retrospective analysis of each of the models was conducted by predicting papers for the majority of months CARD*Shark 2 has been running (1 July 2017–1 November 2020). During this time, CARD*Shark 2 flagged 22 196 unique papers, 66 of which were added to CARD by the curators (Table 3). The benefit of the expanded scope of the supervised learning models can be seen in Figure 2, where 44 papers were successfully identified by the models but never flagged by CARD*Shark 2. At the same time, CARD*Shark 2 was able to identify 16 papers the supervised learning models missed (Figure 2). These results indicate that a combination of CARD*Shark 2 and the supervised learning models may be necessary to identify papers for curation into CARD.

Table 3.

Retrospective predictions against papers added to PubMed between July 2017 and December 2020

Model name	Papers examined	Positive predictions	Added to CARD
Logistic regression	3 955 928	30 049	75
Naive Bayes		75 843	93
Random forest		17 318	69
CARD*Shark 2	22 196	H: 10 676;L: 11 520	H: 58; L: 8

Model name	Papers examined	Positive predictions	Added to CARD
Logistic regression	3 955 928	30 049	75
Naive Bayes		75 843	93
Random forest		17 318	69
CARD*Shark 2	22 196	H: 10 676;L: 11 520	H: 58; L: 8

CARD*Shark 2 categorizes its predictions into an L or H level. L, low; H, high.

Table 3.

Open in new tab Download slide

Retrospective predictions against papers added to PubMed between July 2017 and December 2020

Model name	Papers examined	Positive predictions	Added to CARD
Logistic regression	3 955 928	30 049	75
Naive Bayes		75 843	93
Random forest		17 318	69
CARD*Shark 2	22 196	H: 10 676;L: 11 520	H: 58; L: 8

Model name	Papers examined	Positive predictions	Added to CARD
Logistic regression	3 955 928	30 049	75
Naive Bayes		75 843	93
Random forest		17 318	69
CARD*Shark 2	22 196	H: 10 676;L: 11 520	H: 58; L: 8

CARD*Shark 2 categorizes its predictions into an L or H level. L, low; H, high.

Figure 2.

A Venn diagram illustrating the overlap of each model’s positive paper predictions that were ultimately curated into CARD. The plot based on data from Table 3. For CARD*Shark 2, both high- and low-level predictions are included.

Conclusion

Overall, we have found that supervised learning applications to rapidly triage thousands of publications can viably reduce the burden associated with curating data. However, due to the limited scope associated with CARD’s curation goal (i.e. identifying new ARGs only), models perform with poor precision but high recall. To compensate for this precision, a combination of CARD*Shark 2 and the supervised learning models will be incorporated into CARD by ranking publications based on model agreement to maintain high recall while prioritizing high-value publications (i.e. publications with the highest model agreement are reviewed first). As such, a computer-guided curation paradigm that centers ultimately on expert, human biocuration allows CARD to provide comprehensive, high-value, trustworthy data for genomic surveillance of AMR.

Data availability

Software for CARD*Shark is available at https://github.com/edalatma/card_shark_3.

Funding

Canadian Institutes of Health Research (PJT-156214 to A.G.M.); Cisco Systems Canada, Inc. (a Cisco Research Chair in Bioinformatics to A.G.M.); an Ontario Graduate Scholarship and a McMaster University Ashbaugh Graduate Scholarship (to A.E.). Computer resources were supplied by the McMaster Service Lab and Repository computing cluster, funded in part by grants from the Canada Foundation for Innovation (34531 to A.G.M.).

Conflict of interest statement

The authors declare no competing interests.

Acknowledgements

We thank Brian P. Alcock and Amogelang R. Raphenya of the CARD for assistance with all aspects of this research. We would like to thank Brian P. Alcock, Amogelang R. Raphenya, Kara Tsang, Jalees Nasir, Martins Oloni, William Huynh, Sohaib Syed, Rachel Tran and Marcel Jansen for participation in the validation of CARD*Shark predictions. Lastly, we thank Sachin Doshi and Arjun Sharma for pioneering literature triage methods for CARD.

References

Centers for Disease Control and Prevention (U.S.)

. (

2019

)

Antibiotic Resistance Threats in the United States, 2019

Centers for Disease Control and Prevention

Holmes

A.H.

Moore

L.S.P.

Sundsfjord

et al. (

2016

)

Understanding the mechanisms and drivers of antimicrobial resistance

Lancet

387

176

–

187

Brown

E.D.

and

Wright

G.D.

(

2016

)

Antibacterial drug discovery in the resistance era

Nature

529

336

–

343

Privalsky

T.M.

Soohoo

A.M.

Wang

et al. (

2021

)

Prospects for antibacterial discovery and development

J. Am. Chem. Soc.

143

21127

–

21142

Tamma

P.D.

Fan

Bergman

et al. (

2019

)

Applying rapid whole-genome sequencing to predict phenotypic antimicrobial susceptibility testing results among carbapenem-resistant Klebsiella pneumoniae clinical isolates

Antimicrob. Agents Chemother.

e01923

–

PubMed

Tsang

K.K.

Maguire

Zubyk

H.L.

et al. (

2021

)

Identifying novel β-lactamase substrate activity through in silico prediction of antimicrobial resistance

Microb. Genom.

, mgen.0.000500.

Wang

Zhao

Yin

et al. (

2022

)

A practical approach for predicting antimicrobial phenotype resistance in Staphylococcus aureus through machine learning analysis of genome data

Front. Microbiol.

, 841289.

Kuang

Wang

Hernandez

K.M.

et al. (

2022

)

Accurate and rapid prediction of tuberculosis drug resistance from genome sequence data using traditional machine learning algorithms and CNN

Sci. Rep.

, 2427.

Ellington

M.J.

Ekelund

Aarestrup

F.M.

et al. (

2017

)

The role of whole genome sequencing in antimicrobial susceptibility testing of bacteria: report from the EUCAST Subcommittee

Clin. Microbiol. Infect.

–

10.

Alcock

B.P.

Raphenya

A.R.

Lau

T.T.Y.

et al. (

2020

)

CARD 2020: antibiotic resistome surveillance with the Comprehensive Antibiotic Resistance Database

Nucleic Acids Res.

D517

–

D525

PubMed

11.

Florensa

A.F.

Kaas

R.S.

Clausen

P.T.L.C.

et al. (

2022

)

ResFinder—an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

Microb. Genom.

, 000748.

12.

Feldgarden

Brover

Gonzalez-Escalona

et al. (

2021

)

AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence

Sci. Rep.

, 12728.

13.

Jia

Raphenya

A.R.

Alcock

et al. (

2017

)

CARD 2017: expansion and model-centric curation of the Comprehensive Antibiotic Resistance Database

Nucleic Acids Res.

D566

–

D573

14.

Kotsiantis

S.B.

(

2007

)

Supervised machine learning: a review of classification techniques

. In:

Proceedings of the 2007 Conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies

IOS Press

Amsterdam, The Netherlands

, pp.

–

15.

Sayers

(

2010

)

A General Introduction to the E-utilities

National Center for Biotechnology Information

. https://www.ncbi.nlm.nih.gov/books/NBK25497/ (7 April 2020, date last accessed).

16.

Bird

Loper

and

Klein

(

2009

)

Natural Language Processing with Python

Sebastopol, CA

O’Reilly Media

Google Preview

17.

Uysal

A.K.

and

Gunal

(

2014

)

The impact of preprocessing on text classification

Inf. Process. Manag.

104

–

112

Crossref

18.

Porter

M.F.

(

1980

)

An algorithm for suffix stripping

Program

130

–

137

Crossref

19.

Saxena

Saritha

S.K.

and

Prasad

K.N.S.S.V.

(

2017

)

Survey paper on feature extraction methods in text categorization

Int. J. Comput. Appl.

166

–

20.

Beel

Gipp

Langer

et al. (

2016

)

Research-paper recommender systems: a literature survey

Int. J. Digit. Libr.

305

–

338

Crossref

21.

Pedregosa

Varoquaux

Gramfort

et al. (

2011

)

Scikit-learn: machine learning in Python

J. Mach. Learn. Res.

2825

–

2830