Machine learning approach to literature mining for the genetics of complex diseases Open Access

Metrics used to assess performance of predictive models. ^*G^TP is the number of unique curator-accepted genes in true positive articles. G^{(TP + FN)} is the number of unique curator-accepted genes in true positive (TP) or false negative (FN) articles

Metric	Equation
Recall	TP/(TP + FN)
Precision	TP/(TP + FP)
Gene Recall	^*G^TP/G^{(TP + FN)}
Workload Saving	(TN + FN)/(TP + FP + TN + FN)
F-gene Score	2 (Gene-Recall ^*Precision) / (Precision + Gene-Recall)
F₁Score	2 (Recall ^*Precision) / (precision + recall)

Metric	Equation
Recall	TP/(TP + FN)
Precision	TP/(TP + FP)
Gene Recall	^*G^TP/G^{(TP + FN)}
Workload Saving	(TN + FN)/(TP + FP + TN + FN)
F-gene Score	2 (Gene-Recall ^*Precision) / (Precision + Gene-Recall)
F₁Score	2 (Recall ^*Precision) / (precision + recall)

Table 1

Metric	Equation
Recall	TP/(TP + FN)
Precision	TP/(TP + FP)
Gene Recall	^*G^TP/G^{(TP + FN)}
Workload Saving	(TN + FN)/(TP + FP + TN + FN)
F-gene Score	2 (Gene-Recall ^*Precision) / (Precision + Gene-Recall)
F₁Score	2 (Recall ^*Precision) / (precision + recall)

Metric	Equation
Recall	TP/(TP + FN)
Precision	TP/(TP + FP)
Gene Recall	^*G^TP/G^{(TP + FN)}
Workload Saving	(TN + FN)/(TP + FP + TN + FN)
F-gene Score	2 (Gene-Recall ^*Precision) / (Precision + Gene-Recall)
F₁Score	2 (Recall ^*Precision) / (precision + recall)

Code availability

The methods described in this study were implemented largely using the Julia Programming Language, including for building feature sets, training and testing the classifiers. The programs are available from Zenodo (https://doi.org/10.5281/zenodo.3376769) and freely available for public use.

Results

Class imbalance

We found no significant differences (pROC, P > 0.05) in the AUROC curves for all methods addressing class imbalances for dbPEC. Training and testing on dbPTB showed a significant increase (pROC, P < 0.05) in the AUROC of the Neural Network classifier between models with class imbalances addressed by weights when compared to oversampling shown in Supplementary Table S2 and Supplementary Table S3. Across all classifiers trained and tested on dbPEC, the methods for dealing with class imbalances were shown to have similar AUCPRs shown in Supplementary Table S2 and Supplementary Table S3. In addition, for all classifiers trained and tested on dbPTB, all methods for dealing with class imbalances were shown to have similar AUCPRs. For all subsequent analysis, class imbalances were managed by assigning class weights.

Parameter optimization

Hyper-parameters for the Logistic Regression classifier were optimized using random search 3-fold cross-validation over 1000 iterations. The optimized value for max_iter was determined to be 155 and the optimized value for C was determined to be 1.0 using ‘sag’ as the optimization algorithm. After optimization for the Random Forest classifier, max_features was determined to be 0.8, min_sample_split was determined to be 3, min_impurity_decrease was determined to be 0.0066423, min_sample_leaf was determined to be 4, and criterion was determined to be ‘entropy.’ Optimization for Neural Networks was determined for the four hyper-parameters alpha, learning_rate_init, hidden_layers_sizes, and learning_rate and optimized to be 0.0029, 0.0139, (60160), and ‘invscaling’ respectively. Hyperparamaters and their optimized value are shown in Supplementary Table S4.

Model testing

To test relative contributions of each of the features, features were tested individually, as well in combination. The feature sets used and the size of each of the feature sets are shown in Table 2. Groupings of feature sets were chosen based on feature size and individual feature performance. S1 contains all features except for ‘Bag-of-Words’, testing how the all the features work in conjunction. S2 contains all the features in S1 except for MeSH, since MeSH is the largest feature set and is the most computationally expensive feature to run. S3 contains all the features in S2 except Gene Count, as gene count was one of the worst performing individual feature sets. S4 contains all features in S1 except for Gene-Subject-Predicate. S5 contains all features in S1 except for Gene Count. S6 contains all the features in S1 except for both Gene-Subject-Predicate and Gene Count.

Table 2

Feature set names. A ‘–‘in the Features in Set column denotes that the Features in Set is the same as the set name. The size of the feature set for both dbPEC and dbPTB are listed

Set Name	Features in Set	dbPEC	dbPTB
MeSH	–	11 467	11 157
Gene- Significance	–	1462	1498
Gene-Subject-Predicate	–	56	52
Semtype Count	–	390	375
Species Check	–	1177	1413
Gene Count	–	6	6
Bag-of-Words	–	34 494	31 926
S1	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	14 558	14 501
S2	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	3091	3344
S3	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	3085	338
S4	MeSH + Gene Significance + Semtype Count + Species Check + Gene Count	14 502	14 449
S5	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	14 552	14 495
S6	MeSH + Gene Significance + Semtype Count + Species Check	14 496	14 443

Set Name	Features in Set	dbPEC	dbPTB
MeSH	–	11 467	11 157
Gene- Significance	–	1462	1498
Gene-Subject-Predicate	–	56	52
Semtype Count	–	390	375
Species Check	–	1177	1413
Gene Count	–	6	6
Bag-of-Words	–	34 494	31 926
S1	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	14 558	14 501
S2	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	3091	3344
S3	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	3085	338
S4	MeSH + Gene Significance + Semtype Count + Species Check + Gene Count	14 502	14 449
S5	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	14 552	14 495
S6	MeSH + Gene Significance + Semtype Count + Species Check	14 496	14 443

Table 2

Feature set names. A ‘–‘in the Features in Set column denotes that the Features in Set is the same as the set name. The size of the feature set for both dbPEC and dbPTB are listed

Set Name	Features in Set	dbPEC	dbPTB
MeSH	–	11 467	11 157
Gene- Significance	–	1462	1498
Gene-Subject-Predicate	–	56	52
Semtype Count	–	390	375
Species Check	–	1177	1413
Gene Count	–	6	6
Bag-of-Words	–	34 494	31 926
S1	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	14 558	14 501
S2	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	3091	3344
S3	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	3085	338
S4	MeSH + Gene Significance + Semtype Count + Species Check + Gene Count	14 502	14 449
S5	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	14 552	14 495
S6	MeSH + Gene Significance + Semtype Count + Species Check	14 496	14 443

Set Name	Features in Set	dbPEC	dbPTB
MeSH	–	11 467	11 157
Gene- Significance	–	1462	1498
Gene-Subject-Predicate	–	56	52
Semtype Count	–	390	375
Species Check	–	1177	1413
Gene Count	–	6	6
Bag-of-Words	–	34 494	31 926
S1	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	14 558	14 501
S2	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check + Gene Count	3091	3344
S3	Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	3085	338
S4	MeSH + Gene Significance + Semtype Count + Species Check + Gene Count	14 502	14 449
S5	MeSH + Gene Significance + Semtype Count + Gene-Subject-Predicate + Species Check	14 552	14 495
S6	MeSH + Gene Significance + Semtype Count + Species Check	14 496	14 443

The AUROC ranged from 0.901 to 0.920 using logistic regression, from 0.857 to 0.922 using random forests and from 0.869 to 0.894 using the neural network for the dbPTB data set (Table 3). S1 showed a significant increase (pROC, P < 0.005) in AUC for the Random Forests classifier trained and tested on dbPTB when compared to ‘Bag-of-Words’. For the dbPEC data set, the AUROC for the different predictive models ranged from 0.743 to 0.805 using logistic regression, from 0.761 to 0.837 using random forests and from 0.689 to 0.782 using neural networks (Table 4). S1 showed no significant (pROC, P > 0.05) differences compared to ‘Bag-of-Words’ for any of the classifiers. AUCPR ranged from 0.572 to 0.646 using logistic regression, from 0.616 to 0.680 using random forests and from 0.509 to 0.643 using the neural network for the dbPTB data set (Table 3). For the dbPEC data set, the AUCPR for the different predictive models ranged from 0.597 to 0.653 using logistic regression, from 0.623 to 0.678 using random forests and from 0.531 to 0.619 using neural networks (Table 4). When using the random forests and neural network classifiers trained and tested on both PTB and PEC, S1 showed a greater AUCPR when compared to the other feature sets; however, this difference is within the 95% confidence interval.

Table 3

AUC of the ROC curve for all features trained and tested on dbPTB. 5-fold cross-validation was used to determine average AUCPR with a 95% confidence interval. P-values are listed for each feature set comparing ROC curves to the bag-of-words ROC curves using pROC. A ‘–‘was used to denote bag-of-words being compared to itself

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.646 ± 0.031	0.906	-
	S1	0.622 ± 0.063	0.908	0.918
	S2	0.572 ± 0.078	0.919	0.572
	S3	0.572 ± 0.066	0.920	0.505
	S4	0.619 ± 0.064	0.919	0.526
	S5	0.630 ± 0.048	0.901	0.825
	S6	0.626 ± 0.045	0.907	0.974
Random Forests	BOW	0.621 ± 0.040	0.826	–
	S1	0.680 ± 0.077	0.922	0.002
	S2	0.628 ± 0.070	0.872	0.137
	S3	0.616 ± 0.069	0.862	0.251
	S4	0.678 ± 0.093	0.865	0.221
	S5	0.673 ± 0.077	0.857	0.331
	S6	0.667 ± 0.083	0.870	0.155
Neural Networks	BOW	0.631 ± 0.045	0.894	–
	S1	0.643 ± 0.022	0.893	0.979
	S2	0.530 ± 0.059	0.872	0.488
	S3	0.509 ± 0.075	0.884	0.755
	S4	0.599 ± 0.031	0.869	0.393
	S5	0.580 ± 0.048	0.891	0.892
	S6	0.607 ± 0.098	0.892	0.923

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.646 ± 0.031	0.906	-
	S1	0.622 ± 0.063	0.908	0.918
	S2	0.572 ± 0.078	0.919	0.572
	S3	0.572 ± 0.066	0.920	0.505
	S4	0.619 ± 0.064	0.919	0.526
	S5	0.630 ± 0.048	0.901	0.825
	S6	0.626 ± 0.045	0.907	0.974
Random Forests	BOW	0.621 ± 0.040	0.826	–
	S1	0.680 ± 0.077	0.922	0.002
	S2	0.628 ± 0.070	0.872	0.137
	S3	0.616 ± 0.069	0.862	0.251
	S4	0.678 ± 0.093	0.865	0.221
	S5	0.673 ± 0.077	0.857	0.331
	S6	0.667 ± 0.083	0.870	0.155
Neural Networks	BOW	0.631 ± 0.045	0.894	–
	S1	0.643 ± 0.022	0.893	0.979
	S2	0.530 ± 0.059	0.872	0.488
	S3	0.509 ± 0.075	0.884	0.755
	S4	0.599 ± 0.031	0.869	0.393
	S5	0.580 ± 0.048	0.891	0.892
	S6	0.607 ± 0.098	0.892	0.923

Table 3

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.646 ± 0.031	0.906	-
	S1	0.622 ± 0.063	0.908	0.918
	S2	0.572 ± 0.078	0.919	0.572
	S3	0.572 ± 0.066	0.920	0.505
	S4	0.619 ± 0.064	0.919	0.526
	S5	0.630 ± 0.048	0.901	0.825
	S6	0.626 ± 0.045	0.907	0.974
Random Forests	BOW	0.621 ± 0.040	0.826	–
	S1	0.680 ± 0.077	0.922	0.002
	S2	0.628 ± 0.070	0.872	0.137
	S3	0.616 ± 0.069	0.862	0.251
	S4	0.678 ± 0.093	0.865	0.221
	S5	0.673 ± 0.077	0.857	0.331
	S6	0.667 ± 0.083	0.870	0.155
Neural Networks	BOW	0.631 ± 0.045	0.894	–
	S1	0.643 ± 0.022	0.893	0.979
	S2	0.530 ± 0.059	0.872	0.488
	S3	0.509 ± 0.075	0.884	0.755
	S4	0.599 ± 0.031	0.869	0.393
	S5	0.580 ± 0.048	0.891	0.892
	S6	0.607 ± 0.098	0.892	0.923

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.646 ± 0.031	0.906	-
	S1	0.622 ± 0.063	0.908	0.918
	S2	0.572 ± 0.078	0.919	0.572
	S3	0.572 ± 0.066	0.920	0.505
	S4	0.619 ± 0.064	0.919	0.526
	S5	0.630 ± 0.048	0.901	0.825
	S6	0.626 ± 0.045	0.907	0.974
Random Forests	BOW	0.621 ± 0.040	0.826	–
	S1	0.680 ± 0.077	0.922	0.002
	S2	0.628 ± 0.070	0.872	0.137
	S3	0.616 ± 0.069	0.862	0.251
	S4	0.678 ± 0.093	0.865	0.221
	S5	0.673 ± 0.077	0.857	0.331
	S6	0.667 ± 0.083	0.870	0.155
Neural Networks	BOW	0.631 ± 0.045	0.894	–
	S1	0.643 ± 0.022	0.893	0.979
	S2	0.530 ± 0.059	0.872	0.488
	S3	0.509 ± 0.075	0.884	0.755
	S4	0.599 ± 0.031	0.869	0.393
	S5	0.580 ± 0.048	0.891	0.892
	S6	0.607 ± 0.098	0.892	0.923

Table 4

AUC of the ROC curve and AUC of the precision-recall curve for all features trained and tested on dbPEC. 5-fold cross-validation was used to determine average AUCPR with a 95% confidence interval. P-values are listed for each feature set comparing ROC curves to the bag-of-words ROC curves using pROC. A ‘–‘was used to denote bag-of-words being compared to itself

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.653 ± 0.043	0.802	–
	S1	0.652 ± 0.028	0.805	0.875
	S2	0.600 ± 0.027	0.746	0.007
	S3	0.597 ± 0.027	0.743	0.005
	S4	0.642 ± 0.027	0.805	0.854
	S5	0.639 ± 0.028	0.801	0.991
	S6	0.633 ± 0.066	0.802	0.985
Random Forests	BOW	0.651 ± 0.041	0.837	–
	S1	0.678 ± 0.023	0.805	0.064
	S2	0.636 ± 0.033	0.777	0.001
	S3	0.623 ± 0.043	0.761	0
	S4	0.673 ± 0.031	0.802	0.043
	S5	0.664 ± 0.038	0.806	0.065
	S6	0.641 ± 0.090	0.804	0.053
Neural Networks	BOW	0.613 ± 0.027	0.763	–
	S1	0.619 ± 0.052	0.782	0.372
	S2	0.531 ± 0.037	0.689	0.007
	S3	0.554 ± 0.038	0.691	0.009
	S4	0.595 ± 0.051	0.764	0.946
	S5	0.598 ± 0.048	0.771	0.696
	S6	0.610 ± 0.003	0.743	0.417

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.653 ± 0.043	0.802	–
	S1	0.652 ± 0.028	0.805	0.875
	S2	0.600 ± 0.027	0.746	0.007
	S3	0.597 ± 0.027	0.743	0.005
	S4	0.642 ± 0.027	0.805	0.854
	S5	0.639 ± 0.028	0.801	0.991
	S6	0.633 ± 0.066	0.802	0.985
Random Forests	BOW	0.651 ± 0.041	0.837	–
	S1	0.678 ± 0.023	0.805	0.064
	S2	0.636 ± 0.033	0.777	0.001
	S3	0.623 ± 0.043	0.761	0
	S4	0.673 ± 0.031	0.802	0.043
	S5	0.664 ± 0.038	0.806	0.065
	S6	0.641 ± 0.090	0.804	0.053
Neural Networks	BOW	0.613 ± 0.027	0.763	–
	S1	0.619 ± 0.052	0.782	0.372
	S2	0.531 ± 0.037	0.689	0.007
	S3	0.554 ± 0.038	0.691	0.009
	S4	0.595 ± 0.051	0.764	0.946
	S5	0.598 ± 0.048	0.771	0.696
	S6	0.610 ± 0.003	0.743	0.417

Table 4

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.653 ± 0.043	0.802	–
	S1	0.652 ± 0.028	0.805	0.875
	S2	0.600 ± 0.027	0.746	0.007
	S3	0.597 ± 0.027	0.743	0.005
	S4	0.642 ± 0.027	0.805	0.854
	S5	0.639 ± 0.028	0.801	0.991
	S6	0.633 ± 0.066	0.802	0.985
Random Forests	BOW	0.651 ± 0.041	0.837	–
	S1	0.678 ± 0.023	0.805	0.064
	S2	0.636 ± 0.033	0.777	0.001
	S3	0.623 ± 0.043	0.761	0
	S4	0.673 ± 0.031	0.802	0.043
	S5	0.664 ± 0.038	0.806	0.065
	S6	0.641 ± 0.090	0.804	0.053
Neural Networks	BOW	0.613 ± 0.027	0.763	–
	S1	0.619 ± 0.052	0.782	0.372
	S2	0.531 ± 0.037	0.689	0.007
	S3	0.554 ± 0.038	0.691	0.009
	S4	0.595 ± 0.051	0.764	0.946
	S5	0.598 ± 0.048	0.771	0.696
	S6	0.610 ± 0.003	0.743	0.417

	Feature Set	AUCPR	ROC
	Feature Set	AUCPR	AUC	P-value
Logistic Regression	BOW	0.653 ± 0.043	0.802	–
	S1	0.652 ± 0.028	0.805	0.875
	S2	0.600 ± 0.027	0.746	0.007
	S3	0.597 ± 0.027	0.743	0.005
	S4	0.642 ± 0.027	0.805	0.854
	S5	0.639 ± 0.028	0.801	0.991
	S6	0.633 ± 0.066	0.802	0.985
Random Forests	BOW	0.651 ± 0.041	0.837	–
	S1	0.678 ± 0.023	0.805	0.064
	S2	0.636 ± 0.033	0.777	0.001
	S3	0.623 ± 0.043	0.761	0
	S4	0.673 ± 0.031	0.802	0.043
	S5	0.664 ± 0.038	0.806	0.065
	S6	0.641 ± 0.090	0.804	0.053
Neural Networks	BOW	0.613 ± 0.027	0.763	–
	S1	0.619 ± 0.052	0.782	0.372
	S2	0.531 ± 0.037	0.689	0.007
	S3	0.554 ± 0.038	0.691	0.009
	S4	0.595 ± 0.051	0.764	0.946
	S5	0.598 ± 0.048	0.771	0.696
	S6	0.610 ± 0.003	0.743	0.417

Model evaluation

The performance of the models using the three classifiers and various feature sets is shown in Tables 5 and 6. For dbPTB, workload savings at a 95% gene recall threshold ranged from 0.797 with Random Forest to 0.814 with Neural Networks, compared to the actual manual rejection rate 0.846. For dbPEC, workload savings at a 95% gene recall threshold for dbPEC ranged from 0.283 with Neural Networks to 0.371 with Random Forests, compared to the actual manual rejection rate for of 0.492. In addition, all features sets, except S3 and S4, outperformed ‘Bag-of-Words’ in terms of workload savings on every classifier for both dbPTB and dbPEC. Notably S1, S5 and S6 showed the best performance at a 95% gene recall. In addition, S1, S5 and S6 outperformed ‘Bag-of-Words’ on every classifier in terms of both F₁ score and F-gene score, for both dbPEC and dbPTB.

Table 5

The values of the performance metrics for each feature set trained and tested on dbPTB. Performance metrics were recorded for each classifier and values were recorded at a 95% gene Recall threshold

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.805	0.956	0.379	0.516	0.543	0.716
	S1	0.829	0.956	0.531	0.648	0.683	0.791
	S2	0.756	0.956	0.544	0.633	0.693	0.814
	S3	0.829	0.956	0.523	0.642	0.676	0.788
	S4	0.756	0.956	0.534	0.626	0.686	0.810
	S5	0.805	0.956	0.541	0.647	0.691	0.801
	S6	0.854	0.956	0.556	0.673	0.703	0.794
Random Forests	BOW	0.829	0.965	0.262	0.398	0.412	0.575
	S1	0.878	0.956	0.379	0.529	0.543	0.690
	S2	0.707	0.956	0.460	0.558	0.621	0.794
	S3	0.683	0.956	0.406	0.509	0.570	0.775
	S4	0.683	0.956	0.452	0.544	0.613	0.797
	S5	0.659	0.956	0.435	0.524	0.598	0.797
	S6	0.659	0.956	0.429	0.519	0.592	0.794
Neural Networks	BOW	0.829	0.956	0.374	0.515	0.537	0.703
	S1	0.707	0.956	0.617	0.659	0.750	0.846
	S2	0.805	0.956	0.429	0.559	0.592	0.748
	S3	0.756	0.956	0.443	0.559	0.605	0.771
	S4	0.756	0.956	0.508	0.608	0.664	0.801
	S5	0.756	0.956	0.544	0.633	0.693	0.814
	S6	0.756	0.956	0.554	0.639	0.701	0.817

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.805	0.956	0.379	0.516	0.543	0.716
	S1	0.829	0.956	0.531	0.648	0.683	0.791
	S2	0.756	0.956	0.544	0.633	0.693	0.814
	S3	0.829	0.956	0.523	0.642	0.676	0.788
	S4	0.756	0.956	0.534	0.626	0.686	0.810
	S5	0.805	0.956	0.541	0.647	0.691	0.801
	S6	0.854	0.956	0.556	0.673	0.703	0.794
Random Forests	BOW	0.829	0.965	0.262	0.398	0.412	0.575
	S1	0.878	0.956	0.379	0.529	0.543	0.690
	S2	0.707	0.956	0.460	0.558	0.621	0.794
	S3	0.683	0.956	0.406	0.509	0.570	0.775
	S4	0.683	0.956	0.452	0.544	0.613	0.797
	S5	0.659	0.956	0.435	0.524	0.598	0.797
	S6	0.659	0.956	0.429	0.519	0.592	0.794
Neural Networks	BOW	0.829	0.956	0.374	0.515	0.537	0.703
	S1	0.707	0.956	0.617	0.659	0.750	0.846
	S2	0.805	0.956	0.429	0.559	0.592	0.748
	S3	0.756	0.956	0.443	0.559	0.605	0.771
	S4	0.756	0.956	0.508	0.608	0.664	0.801
	S5	0.756	0.956	0.544	0.633	0.693	0.814
	S6	0.756	0.956	0.554	0.639	0.701	0.817

Table 5

The values of the performance metrics for each feature set trained and tested on dbPTB. Performance metrics were recorded for each classifier and values were recorded at a 95% gene Recall threshold

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.805	0.956	0.379	0.516	0.543	0.716
	S1	0.829	0.956	0.531	0.648	0.683	0.791
	S2	0.756	0.956	0.544	0.633	0.693	0.814
	S3	0.829	0.956	0.523	0.642	0.676	0.788
	S4	0.756	0.956	0.534	0.626	0.686	0.810
	S5	0.805	0.956	0.541	0.647	0.691	0.801
	S6	0.854	0.956	0.556	0.673	0.703	0.794
Random Forests	BOW	0.829	0.965	0.262	0.398	0.412	0.575
	S1	0.878	0.956	0.379	0.529	0.543	0.690
	S2	0.707	0.956	0.460	0.558	0.621	0.794
	S3	0.683	0.956	0.406	0.509	0.570	0.775
	S4	0.683	0.956	0.452	0.544	0.613	0.797
	S5	0.659	0.956	0.435	0.524	0.598	0.797
	S6	0.659	0.956	0.429	0.519	0.592	0.794
Neural Networks	BOW	0.829	0.956	0.374	0.515	0.537	0.703
	S1	0.707	0.956	0.617	0.659	0.750	0.846
	S2	0.805	0.956	0.429	0.559	0.592	0.748
	S3	0.756	0.956	0.443	0.559	0.605	0.771
	S4	0.756	0.956	0.508	0.608	0.664	0.801
	S5	0.756	0.956	0.544	0.633	0.693	0.814
	S6	0.756	0.956	0.554	0.639	0.701	0.817

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.805	0.956	0.379	0.516	0.543	0.716
	S1	0.829	0.956	0.531	0.648	0.683	0.791
	S2	0.756	0.956	0.544	0.633	0.693	0.814
	S3	0.829	0.956	0.523	0.642	0.676	0.788
	S4	0.756	0.956	0.534	0.626	0.686	0.810
	S5	0.805	0.956	0.541	0.647	0.691	0.801
	S6	0.854	0.956	0.556	0.673	0.703	0.794
Random Forests	BOW	0.829	0.965	0.262	0.398	0.412	0.575
	S1	0.878	0.956	0.379	0.529	0.543	0.690
	S2	0.707	0.956	0.460	0.558	0.621	0.794
	S3	0.683	0.956	0.406	0.509	0.570	0.775
	S4	0.683	0.956	0.452	0.544	0.613	0.797
	S5	0.659	0.956	0.435	0.524	0.598	0.797
	S6	0.659	0.956	0.429	0.519	0.592	0.794
Neural Networks	BOW	0.829	0.956	0.374	0.515	0.537	0.703
	S1	0.707	0.956	0.617	0.659	0.750	0.846
	S2	0.805	0.956	0.429	0.559	0.592	0.748
	S3	0.756	0.956	0.443	0.559	0.605	0.771
	S4	0.756	0.956	0.508	0.608	0.664	0.801
	S5	0.756	0.956	0.544	0.633	0.693	0.814
	S6	0.756	0.956	0.554	0.639	0.701	0.817

Table 6

The values of the performance metrics for each feature set trained and tested on dbPEC. Performance metrics were recorded for each classifier and values were recorded at a 95% gene Recall threshold

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.950	0.951	0.425	0.588	0.588	0.247
	S1	0.967	0.956	0.439	0.604	0.602	0.258
	S2	0.939	0.951	0.421	0.582	0.584	0.249
	S3	0.944	0.966	0.413	0.574	0.578	0.228
	S4	0.967	0.956	0.444	0.608	0.606	0.266
	S5	0.961	0.951	0.458	0.620	0.618	0.292
	S6	0.956	0.951	0.461	0.622	0.621	0.301
Random Forests	BOW	0.939	0.956	0.448	0.607	0.610	0.294
	S1	0.928	0.951	0.488	0.640	0.645	0.360
	S2	0.922	0.956	0.445	0.600	0.607	0.301
	S3	0.917	0.951	0.426	0.582	0.589	0.275
	S4	0.950	0.951	0.491	0.648	0.648	0.348
	S5	0.922	0.961	0.494	0.643	0.653	0.371
	S6	0.922	0.956	0.484	0.635	0.643	0.358
Neural Networks	BOW	0.967	0.956	0.407	0.572	0.570	0.199
	S1	0.961	0.966	0.410	0.575	0.576	0.210
	S2	0.922	0.951	0.400	0.558	0.563	0.223
	S3	0.944	0.951	0.369	0.530	0.531	0.137
	S4	0.967	0.966	0.371	0.536	0.536	0.122
	S5	0.961	0.956	0.448	0.611	0.610	0.277
	S6	0.922	0.956	0.433	0.590	0.596	0.283

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.950	0.951	0.425	0.588	0.588	0.247
	S1	0.967	0.956	0.439	0.604	0.602	0.258
	S2	0.939	0.951	0.421	0.582	0.584	0.249
	S3	0.944	0.966	0.413	0.574	0.578	0.228
	S4	0.967	0.956	0.444	0.608	0.606	0.266
	S5	0.961	0.951	0.458	0.620	0.618	0.292
	S6	0.956	0.951	0.461	0.622	0.621	0.301
Random Forests	BOW	0.939	0.956	0.448	0.607	0.610	0.294
	S1	0.928	0.951	0.488	0.640	0.645	0.360
	S2	0.922	0.956	0.445	0.600	0.607	0.301
	S3	0.917	0.951	0.426	0.582	0.589	0.275
	S4	0.950	0.951	0.491	0.648	0.648	0.348
	S5	0.922	0.961	0.494	0.643	0.653	0.371
	S6	0.922	0.956	0.484	0.635	0.643	0.358
Neural Networks	BOW	0.967	0.956	0.407	0.572	0.570	0.199
	S1	0.961	0.966	0.410	0.575	0.576	0.210
	S2	0.922	0.951	0.400	0.558	0.563	0.223
	S3	0.944	0.951	0.369	0.530	0.531	0.137
	S4	0.967	0.966	0.371	0.536	0.536	0.122
	S5	0.961	0.956	0.448	0.611	0.610	0.277
	S6	0.922	0.956	0.433	0.590	0.596	0.283

Table 6

10.1016/j.ygeno.2012.11.001

The values of the performance metrics for each feature set trained and tested on dbPEC. Performance metrics were recorded for each classifier and values were recorded at a 95% gene Recall threshold

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.950	0.951	0.425	0.588	0.588	0.247
	S1	0.967	0.956	0.439	0.604	0.602	0.258
	S2	0.939	0.951	0.421	0.582	0.584	0.249
	S3	0.944	0.966	0.413	0.574	0.578	0.228
	S4	0.967	0.956	0.444	0.608	0.606	0.266
	S5	0.961	0.951	0.458	0.620	0.618	0.292
	S6	0.956	0.951	0.461	0.622	0.621	0.301
Random Forests	BOW	0.939	0.956	0.448	0.607	0.610	0.294
	S1	0.928	0.951	0.488	0.640	0.645	0.360
	S2	0.922	0.956	0.445	0.600	0.607	0.301
	S3	0.917	0.951	0.426	0.582	0.589	0.275
	S4	0.950	0.951	0.491	0.648	0.648	0.348
	S5	0.922	0.961	0.494	0.643	0.653	0.371
	S6	0.922	0.956	0.484	0.635	0.643	0.358
Neural Networks	BOW	0.967	0.956	0.407	0.572	0.570	0.199
	S1	0.961	0.966	0.410	0.575	0.576	0.210
	S2	0.922	0.951	0.400	0.558	0.563	0.223
	S3	0.944	0.951	0.369	0.530	0.531	0.137
	S4	0.967	0.966	0.371	0.536	0.536	0.122
	S5	0.961	0.956	0.448	0.611	0.610	0.277
	S6	0.922	0.956	0.433	0.590	0.596	0.283

	Feature Set	Recall	Gene Recall	Precision	F Score	F-gene Score	Workload Savings
Logistic Regression	BOW	0.950	0.951	0.425	0.588	0.588	0.247
	S1	0.967	0.956	0.439	0.604	0.602	0.258
	S2	0.939	0.951	0.421	0.582	0.584	0.249
	S3	0.944	0.966	0.413	0.574	0.578	0.228
	S4	0.967	0.956	0.444	0.608	0.606	0.266
	S5	0.961	0.951	0.458	0.620	0.618	0.292
	S6	0.956	0.951	0.461	0.622	0.621	0.301
Random Forests	BOW	0.939	0.956	0.448	0.607	0.610	0.294
	S1	0.928	0.951	0.488	0.640	0.645	0.360
	S2	0.922	0.956	0.445	0.600	0.607	0.301
	S3	0.917	0.951	0.426	0.582	0.589	0.275
	S4	0.950	0.951	0.491	0.648	0.648	0.348
	S5	0.922	0.961	0.494	0.643	0.653	0.371
	S6	0.922	0.956	0.484	0.635	0.643	0.358
Neural Networks	BOW	0.967	0.956	0.407	0.572	0.570	0.199
	S1	0.961	0.966	0.410	0.575	0.576	0.210
	S2	0.922	0.951	0.400	0.558	0.563	0.223
	S3	0.944	0.951	0.369	0.530	0.531	0.137
	S4	0.967	0.966	0.371	0.536	0.536	0.122
	S5	0.961	0.956	0.448	0.611	0.610	0.277
	S6	0.922	0.956	0.433	0.590	0.596	0.283

Discussion

This study explored the potential of using machine learning approaches to identify scientific articles with genes or genetic information relevant to complex diseases. We used logistic regression, random forests, and neural networks to classify articles relevant to the diseases of interest that should be considered for further formal analysis. Random search cross-validation was used to optimize for the hyper-parameters of the various classifiers. This method was used instead of grid search cross-validation due to the former being less computationally demanding (31). Our previously published and publicly accessible, curated databases for two complex diseases, preterm birth and preeclampsia, served as our reference data sets. To test the models, articles were separated into a training and test set. Given the complexity of this classification task, 80% of the articles were selected for the training set to ensure that we had a sufficient number of training examples to develop a reliable predictive model. This 80% training set approach comports with the training set size used in an example evaluation of the Scikit-learn’s liblinear logistic regression (36). Class imbalances can adversely influence classifier performance due to predictive bias in favor of the majority class (17). Since the majority class is more heavily represented in the dataset, it tends to have more influence on cases of uncertainty, which can lead to over prediction of majority cases (17). Using pROC, ROC curves for the various methods for dealing with class imbalances were compared. No significant differences (pROC, P > 0.05) were found between various methods for dealing with class imbalances in dbPEC. In dbPTB, weights were determined to significantly increase AUROC for the neural network classifier. For both dbPTB and dbPEC, AUCPR was found to be similar across all classifiers, for all methods of dealing with class imbalances. As such, class imbalances were managed by assigning class weights.

We compared the performance of each of the machine learning classifiers trained with combinations of different feature sets. The purpose of our approach was primarily to reduce the total number of articles that we needed to review manually in order to identify the genes associated with a condition of interest. This meant that prioritizing a predictive model that classifies articles with high recall was important. However, recall alone is not entirely adequate for evaluating the ability of a classifier to help curators identify the genes associated with pathogenesis. Instead, we defined two new measures (gene recall and F-gene score) for evaluation of the classifiers. In addition, we defined novel feature sets to identify species, gene mentions, gene-subject interactions, and gene-quantitative concept co-occurrences in the titles and abstracts. These feature sets were tested independently and in various combinations. The combined feature sets S1, S2, S5 and S6 showed greater degrees of workload savings across all classifiers when compared to ‘Bag-of-Words.’ In addition, these feature sets were much smaller than ‘Bag-of-Words’, making them more computationally inexpensive and better for curating large quantities of articles. Our results suggest that machine learning algorithms can identify articles of interest for creation or maintenance of a database or gene set for complex diseases. Given the enormity of the manual classifications of articles reviewed for the reference databases, we conclude that our pipeline performed well for its ability to both prioritize articles with relevant genetic information and decrease curator workload. Taken together, our analysis shows that automation can help curators more efficiently review literature for genetic markers of human disease while still maintaining accuracy comparable to strict manual curation.

In the developed pipeline, we used several text-mining tools to annotate and extraction information from the title and abstract of articles. The rich feature-set gathered from annotated titles and abstracts using these tools allowed us to develop a predictive model that met our standards for gene recall and provided reasonable workload savings. The workload savings, while markedly different between our analysis of preterm birth and preeclampsia data sets, were reasonably close to the percentage of papers that were ‘not considered’ by manual curation. These savings are sufficiently large to justify use of our pipeline in future curation efforts and maintenance of these databases. Furthermore, with a gene recall of 95%, our model captures most relevant genes. Genes not captured are likely to be identified by other means, such as the screening of publicly available databases for genetic data or pathway-based gene imputation (37).

When assessing the performance of our pipeline, we acknowledge the abundance of similar tools that utilize machine learning to classify articles as relevant or irrelevant for curation. Many such tools have been developed to simplify systematic review. Notable examples include AbstrackR (13) and Rayyan (12). These tools and others have been shown to classify articles for inclusion in systematic review with recall that outperform our models (11, 22). For our data sets, at a 95% gene recall threshold, AbstrackR yielded a greater recall but lower workload savings and precision, recovering more articles with less genetic relevance [data not shown]. Although these tools may be useful for automating triage for many Systematic Reviews, for a more nuanced curation task, such as identifying articles to maintain a phenotype-specific genetic database, it may be helpful to utilize a more specialized machine learning approach as our own.

Beyond what has already been described, a significant advantage of our pipeline is that it allows for granular control over classification, with the added benefit of generating a MySQL database that stores relevant article information relevant to curation teams. This includes descriptive metadata and the annotations previously described. Having this easily accessible data enables curation teams to further characterize the set of ‘considered’ articles and information useful for future association testing. With our pipeline, tasks such as the identification of all genes and mutations mentioned in the titles and abstracts can be performed in a single query. Given that our pipeline is specifically designed to identify articles with relevant genetic information, it may be the preferred approach for those curating literature for the genetic study of complex diseases.

A possible limitation of our approach is that it is unclear how well our pipeline will perform on other data sets. Our separate evaluations of dbPTB and dbPEC reveal different values for recall, specificity, gene recall and workload reduction. This variance is likely multifactorial in origin. The accuracy of our predictive model will be dependent on how articles were originally selected for consideration, how robust the collection of literature is on a given condition of interest, the number of genes that have been shown to contribute to the condition, and variations in how the condition is characterized in biomedical literature. Using this pipeline to curate articles relevant for the genetic study of other conditions will be necessary for further evaluation.

Additional features and classifiers were evaluated that were not used in our final pipeline. Other classifiers that were considered included a support vector machine and a second neural network. The support vector machine was omitted due to inferior performance as well as known but minor inaccuracies in its probabilistic output (4). A second multilayer perceptron was also developed using Mocha.jl (https://github.com/pluskid/Mocha.jl), but it was omitted due to poor performance using Mocha.jl. Additional features that were considered included the journal ISSN and publication year. The journal ISSN associated with each article was not used because it did not improve prediction accuracy. Publication year was not used due to bias in the training set, as the preeclampsia data set only included articles published in 2014 and 2015 that were rejected during manual curation. Accepted papers published in 2014 and 2015 had not been updated in the preeclampsia database at the time of data access.

Conclusions

We have developed a machine learning-based computational pipeline that can identify of articles that meet criteria for formal curation. This approach allows for a significant reduction in curation workload for those seeking a comprehensive collection of literature that documents the genes related to a phenotype of interest. This approach may prove to be generalizable to other phenotypes or diseases of interest with a robust base of publications. Furthermore, comparative evaluation of the machine learning models demonstrated that the combined feature sets S1, S2, S5 and S6 performed better in terms of workload savings than bag-of-words. In addition, S1, S5 and S6 were shown to outperform bag-of-words in F₁ score and F-gene score. Moreover, our feature sets are less than half the size of bag-of-words and as such are less computationally expensive. This is notable particularly when curating large quantities of articles. Use of these predictive models can potentially improve the efficiency of future curation efforts.

Funding

The United States National Institutes of Health (P20GM109035, P20GM121298, P30GM114750 and U54GM115677).

Conflict of interest: None declared.

References

Uzun

Laliberte

Parker

et al. (

2012

bar069. Epub 2012/02/08

)

dbPTB: a database for preterm birth. Database (Oxford)

. doi:

10.1093/database/bar069

PubMed PMID: 22323062; PubMed Central PMCID: PMCPMC3275764

Uzun

Triche

E.W.

Schuster

et al. (

2016

Epub 2016/03/05

)

dbPEC: a comprehensive literature-based database for preeclampsia related genes and phenotypes. Database (Oxford)

. doi:

10.1093/database/baw006

PubMed PMID: 26946289; PubMed Central PMCID: PMCPMC4779341

Bianco

A.M.

Marcuzzi

Zanin

et al. (

2013

)

Database tools in genetic diseases research

Genomics

101

–

Epub 2012/11/10

. doi:

PubMed PMID: 23147677

T.-F.

Lin

C.-J.

and

Weng

R.C.

(

2004

)

Probability estimates for multi-class classification by pairwise coupling

J. Machine Learn. Res.

975

–

1005

10.1093/bioinformatics/btm229

Winnenburg

Wächter

Plake

et al. (

2008

)

Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?

Brief. Bioinform.

466

–

478

Epub 2008/12/06

. doi:

10.1093/bib/bbn043

PubMed PMID: 19060303

Baumgartner

W.A.

Cohen

K.B.

Fox

L.M.

et al. (

2007

)

Manual curation is not sufficient for annotation of genomic databases

Bioinformatics

i41

–

i48

. doi:

PubMed PMID: 17646325; PubMed Central PMCID: PMCPMC2516305

Brookes

A.J.

and

Robinson

P.N.

(

2015

)

Human genotype-phenotype databases: aims, challenges and opportunities

Nat. Rev. Genet.

702

–

715

Epub 2015/11/10

. doi:

10.1038/nrg3932

PubMed PMID: 26553330

Bastian

Glasziou

and

Chalmers

(

2010

)

Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS med

e1000326

Epub 2010/09/30

. doi:

10.1371/journal.pmed.1000326

PubMed PMID: 20877712; PubMed Central PMCID: PMCPMC2943439

Crequit

Trinquart

Yavchitz

and

Ravaud

(

2016

)

Wasted research when systematic reviews fail to provide a complete and up-to-date evidence synthesis: the example of lung cancer

BMC Med.

Epub 2016/01/23

. doi:

10.1186/s12916-016-0555-0

PubMed PMID: 26792360; PubMed Central PMCID: PMCPMC4719540

10.

O’Mara-Eves

Thomas

McNaught

et al. (

2015

)

Using text mining for study identification in systematic reviews: a systematic review of current approaches

Syst. Rev.

Epub 2015/01/14

. doi:

10.1186/2046-4053-4-5

PubMed PMID: 25588314; PubMed Central PMCID: PMCPMC4320539

11.

Bannach-Brown

Przybyła

Thomas

et al. (

2018

)

The use of text-mining and machine learning algorithms in systematic reviews: reducing workload in preclinical biomedical sciences and reducing human screening error

bioRxiv 255760

doi:

10.1101/255760

10.1186/s13643-016-0384-4

12.

Ouzzani

Hammady

Fedorowicz

et al. (

2016

)

Rayyan-a web and mobile app for systematic reviews

Syst. Rev.

210

Epub 2016/12/05

. doi:

PubMed PMID: 27919275; PubMed Central PMCID: PMCPMC5139140

13.

Wallace

B.C.

Small

Brodley

C.E.

et al. (

2012

) Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. In:

Proc. of the ACM International Health Informatics Symposium (IHI)

, pp.

819

–

824

14.

Hirschman

Burns

G.A.

Krallinger

et al. (

2012

bas020. Epub 2012/04/20

)

Text mining for the biocuration workflow

Database (Oxford)

. doi:

10.1093/database/bas020

PubMed PMID: 22513129; PubMed Central PMCID: PMCPMC3328793

10.1016/j.jclinepi.2017.08.011

15.

Thomas

Noel-Storr

Marshall

et al. (

2017

)

Living systematic reviews: 2. Combining human and machine effort

J. Clin. Epidemiol.

–

Epub 2017/09/16

. doi:

PubMed PMID: 28912003

16.

Marshall

. (

2017

) http://systematicreviewtools.com/index.php

[cited 2019]

17.

Almeida

Meurs

M.J.

Kosseim

et al. (

2014

)

Machine learning for biomedical literature triage

PLoS One

e115892

Epub 2014/12/31

. doi:

10.1371/journal.pone.0115892

PubMed PMID: 25551575; PubMed Central PMCID: PMCPMC4281078

18.

Howe

Costanzo

Fey

et al. (

2008

)

Big data: the future of biocuration

Nature

455

–

Epub 2008/09/05

. doi:

10.1038/455047a

PubMed PMID: 18769432; PubMed Central PMCID: PMCPMC2819144

19.

Muller

H.M.

Kenny

E.E.

and

Sternberg

P.W.

(

2004

)

Textpresso: an ontology-based information retrieval and extraction system for biological literature

PLoS Biol.

e309

Epub 2004/09/24

. doi:

10.1371/journal.pbio.0020309

PubMed PMID: 15383839; PubMed Central PMCID: PMCPMC517822

20.

Gates

Johnson

and

Hartling

(

2018

)

Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool

Syst. Rev.

Epub 2018/03/14

. doi:

10.1186/s13643-018-0707-8

PubMed PMID: 29530097; PubMed Central PMCID: PMCPMC5848519

21.

Van Auken

Fey

Berardini

T.Z.

et al. (

2012

bas040. Epub 2012/11/20

)

Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR

Database (Oxford)

. doi:

10.1093/database/bas040

PubMed PMID: 23160413; PubMed Central PMCID: PMCPMC3500519

22.

Rathbone

Hoffmann

and

Glasziou

(

2015

)

Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers

Syst. Rev.

23.

Cox

(

1958

)

The regression analysis of binary sequences

J. R. Stat. Soc. B. Methodol.

215

–

242

24.

Random Forests

B.L.

(

2001

)

Machine Learn.

–

25.

McCulloch

and

Pitts

(

1943

)

A logical calculus of the ideas immanent in nervous activity

Bull. Math. Biophys.

115

–

133

10.1093/bioinformatics/btp049

26.

Hur

Schuyler

A.D.

States

D.J.

et al. (

2009

)

SciMiner: web-based literature mining tool for target identification and functional enrichment analysis

Bioinformatics

838

–

840

Epub 2009/02/04

. doi:

PubMed PMID: 19188191; PubMed Central PMCID: PMC2654801

doi: btp049 [pii]

27.

Wei

C.H.

Kao

H.Y.

and

PubTator

L.Z.

(

2013

)

A web-based text mining tool for assisting biocuration

Nucleic Acids Res.

Web Server issue

W518

–

W522

Epub 2013/05/22

. doi:

10.1093/nar/gkt441

PubMed PMID: 23703206; PubMed Central PMCID: PMCPMC3692066

28.

Aronson

A.R.

(

2001

)

Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program

Proc AMIA Symp.

–

PubMed PMID: 11825149; PubMed Central PMCID: PMCPMC2243666

10.1016/j.jbi.2003.11.003

29.

Rindflesch

T.C.

and

Fiszman

(

2003

)

The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text

J. Biomed. Inform.

462

–

477

. doi:

PubMed PMID: 14759819

30.

Pedregosa

Varoquaux

Gramfort

et al. (

2011

)

Scikit-learn: machine learning in python

J. Machine Learn. Res.

2825

–

2830

Epub 2/1/2011

31.

. Random Search for Hyper-Parameter Optimization.

J. Machine Learn. Res.

2012

;

281

–

305

32.

Probst

Wright

and

Boulesteix

(

2019

)

Hyperparameters and tuning strategies for Random Forest

Wires Data Mining Knowl. Discov.

e1301

. doi:

33.

Snoek

Adams

R.A.

and

Larochelle

(

2012

)

Practical Bayesian optimization of machine learning algorithms

Proceeding NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2

Pages

2951

–

2959

10.1093/bioinformatics/btv153

34.

Robin

Turck

Hainard

et al. (

2011

)

pROC: an open-source package for R and S+ to analyze and compare ROC curves

BMC Bioinformatics

Epub 2011/03/19

. doi:

10.1186/1471-2105-12-77

PubMed PMID: 21414208; PubMed Central PMCID: PMCPMC3068975

35.

Grau

Grosse

and

Keilwagen

(

2015

)

PRROC Hyperparameters: computing and visualizing precision-recall and receiver operating characteristic curves in R

Bioinformatics

2595

–

2597

Epub 2015/03/27

. doi:

PubMed PMID: 25810428; PubMed Central PMCID: PMCPMC4514923

36.

Fan

R.-E.

Chang

K.-W.

Hsieh

C.-J.

et al. (

2008

)

LIBLINEAR Hyperparameters: a library for large linear classification

J. Machine Learn. Res.

, C-J Lin,

1871

–

1874