Assessing the use of supplementary materials to improve genomic variant discovery Open Access

Impact of the supplementary materials to retrieve documents for variants returning silence based on MEDLINE and PubMed Central

	BRCA	ClinVar	Total
Number of variants for which no documents were retrieved in MEDLINE and PubMed Central (baseline)	136	771	907
Number of variants from the baseline for which at least one document was retrieved in the SD	130	441	571
Average number of documents retrieved in the SD for these variants	4.82 (min: 1; max: 18)	3.12 (min: 1; max: 29)	3.51 (min: 1; max: 29)
Relative reduction of the silence when using the SD	−95.56%	−57.20%	−62.95%

	BRCA	ClinVar	Total
Number of variants for which no documents were retrieved in MEDLINE and PubMed Central (baseline)	136	771	907
Number of variants from the baseline for which at least one document was retrieved in the SD	130	441	571
Average number of documents retrieved in the SD for these variants	4.82 (min: 1; max: 18)	3.12 (min: 1; max: 29)	3.51 (min: 1; max: 29)
Relative reduction of the silence when using the SD	−95.56%	−57.20%	−62.95%

Table 1.

Impact of the supplementary materials to retrieve documents for variants returning silence based on MEDLINE and PubMed Central

	BRCA	ClinVar	Total
Number of variants for which no documents were retrieved in MEDLINE and PubMed Central (baseline)	136	771	907
Number of variants from the baseline for which at least one document was retrieved in the SD	130	441	571
Average number of documents retrieved in the SD for these variants	4.82 (min: 1; max: 18)	3.12 (min: 1; max: 29)	3.51 (min: 1; max: 29)
Relative reduction of the silence when using the SD	−95.56%	−57.20%	−62.95%

	BRCA	ClinVar	Total
Number of variants for which no documents were retrieved in MEDLINE and PubMed Central (baseline)	136	771	907
Number of variants from the baseline for which at least one document was retrieved in the SD	130	441	571
Average number of documents retrieved in the SD for these variants	4.82 (min: 1; max: 18)	3.12 (min: 1; max: 29)	3.51 (min: 1; max: 29)
Relative reduction of the silence when using the SD	−95.56%	−57.20%	−62.95%

The impact of supplementary materials to retrieve novel documents is shown in Table 2. For both benchmarks, a strong increase of the retrieved documents is observed. Indeed, on average, the use of the supplementary material index at least doubled the number of documents retrieved (on average +132.57%). In the BRCA benchmark, while we initially retrieved an average of 8.23 documents per variant using MEDLINE and PubMed Central, using the supplementary materials results in retrieving 9.64 new documents (i.e. +117.15%). In the ClinVar benchmark, an average of 1.26 documents were retrieved per variant in MEDLINE and PubMed Central, whereas using the supplementary materials enabled to retrieve on average 2.69 new documents (+213.59%).

Table 2.

Impact of the supplementary materials to retrieve new articles compared to MEDLINE and PubMed Central

	BRCA	ClinVar	Total
Average number of documents retrieved in MEDLINE and PubMed Central (baseline)	8.23 (min: 0; max: 384)	1.26 (min: 0; max: 274)	4.36 (min: 0; max: 384)
Average number of documents retrieved in the SD	10.30 (min: 0; max: 74)	2.73 (min: 0; max: 94)	6.10 (min: 0; max: 94)
Average number of documents retrieved in the SD that are novel	9.64 (min: 0; max: 59)	2.69 (min: 0; max: 83)	5.78 (min: 0; max: 83)
Average number of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	17.87 (min: 0; max: 440)	3.95 (min: 0; max: 357)	10.14 (min: 0; max: 440)
Relative gain of the SD compared to the baseline	+117.15%	+213.59%	132.57%

	BRCA	ClinVar	Total
Average number of documents retrieved in MEDLINE and PubMed Central (baseline)	8.23 (min: 0; max: 384)	1.26 (min: 0; max: 274)	4.36 (min: 0; max: 384)
Average number of documents retrieved in the SD	10.30 (min: 0; max: 74)	2.73 (min: 0; max: 94)	6.10 (min: 0; max: 94)
Average number of documents retrieved in the SD that are novel	9.64 (min: 0; max: 59)	2.69 (min: 0; max: 83)	5.78 (min: 0; max: 83)
Average number of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	17.87 (min: 0; max: 440)	3.95 (min: 0; max: 357)	10.14 (min: 0; max: 440)
Relative gain of the SD compared to the baseline	+117.15%	+213.59%	132.57%

Table 2.

Impact of the supplementary materials to retrieve new articles compared to MEDLINE and PubMed Central

	BRCA	ClinVar	Total
Average number of documents retrieved in MEDLINE and PubMed Central (baseline)	8.23 (min: 0; max: 384)	1.26 (min: 0; max: 274)	4.36 (min: 0; max: 384)
Average number of documents retrieved in the SD	10.30 (min: 0; max: 74)	2.73 (min: 0; max: 94)	6.10 (min: 0; max: 94)
Average number of documents retrieved in the SD that are novel	9.64 (min: 0; max: 59)	2.69 (min: 0; max: 83)	5.78 (min: 0; max: 83)
Average number of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	17.87 (min: 0; max: 440)	3.95 (min: 0; max: 357)	10.14 (min: 0; max: 440)
Relative gain of the SD compared to the baseline	+117.15%	+213.59%	132.57%

	BRCA	ClinVar	Total
Average number of documents retrieved in MEDLINE and PubMed Central (baseline)	8.23 (min: 0; max: 384)	1.26 (min: 0; max: 274)	4.36 (min: 0; max: 384)
Average number of documents retrieved in the SD	10.30 (min: 0; max: 74)	2.73 (min: 0; max: 94)	6.10 (min: 0; max: 94)
Average number of documents retrieved in the SD that are novel	9.64 (min: 0; max: 59)	2.69 (min: 0; max: 83)	5.78 (min: 0; max: 83)
Average number of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	17.87 (min: 0; max: 440)	3.95 (min: 0; max: 357)	10.14 (min: 0; max: 440)
Relative gain of the SD compared to the baseline	+117.15%	+213.59%	132.57%

Precision-oriented analysis

First, we broadly assessed the relevance of the information found in the supplementary materials retrieved for the queries of the benchmarks. We compared the clinical significance of the variants appearing only in the SD with the variants appearing in MEDLINE or PubMed Central (Table 3). While there was a difference in the distribution of the clinical significance values between both sets in the whole benchmark [X²(2, N = 1803) = 67.13, P < 0.01], the difference concerned mainly the proportion of variants of unknown significance, which was higher in the SD. The relative proportions of pathogenic and benign variants were not significantly different in the whole benchmark [X²(1, N = 583) = 0.09, P = 0.76] between variants found only in SD and the others.

Table 3.

Frequency distribution of the clinical significance of the variants found only in the supplementary data compared to the variants retrieved in MEDLINE and PubMed Central

		BRCA	ClinVar	Total
Distribution of clinical significance for variants retrieved in MEDLINE and PubMed Central (N = 1232)	Pathogenic	16.64%	19.14%	17.78%
	Benign	28.68%	11.09%	20.70%
	Unknown significance	54.68%	69.77%	61.53%
Distribution of clinical significance for variants retrieved in the SD but not in MEDLINE and PubMed Central (N = 571)	Pathogenic	7.69%	8.62%	8.41%
	Benign	3.08%	12.93%	10.68%
	Unknown significance	89.23%	78.46%	80.91%

		BRCA	ClinVar	Total
Distribution of clinical significance for variants retrieved in MEDLINE and PubMed Central (N = 1232)	Pathogenic	16.64%	19.14%	17.78%
	Benign	28.68%	11.09%	20.70%
	Unknown significance	54.68%	69.77%	61.53%
Distribution of clinical significance for variants retrieved in the SD but not in MEDLINE and PubMed Central (N = 571)	Pathogenic	7.69%	8.62%	8.41%
	Benign	3.08%	12.93%	10.68%
	Unknown significance	89.23%	78.46%	80.91%

Table 3.

Frequency distribution of the clinical significance of the variants found only in the supplementary data compared to the variants retrieved in MEDLINE and PubMed Central

		BRCA	ClinVar	Total
Distribution of clinical significance for variants retrieved in MEDLINE and PubMed Central (N = 1232)	Pathogenic	16.64%	19.14%	17.78%
	Benign	28.68%	11.09%	20.70%
	Unknown significance	54.68%	69.77%	61.53%
Distribution of clinical significance for variants retrieved in the SD but not in MEDLINE and PubMed Central (N = 571)	Pathogenic	7.69%	8.62%	8.41%
	Benign	3.08%	12.93%	10.68%
	Unknown significance	89.23%	78.46%	80.91%

		BRCA	ClinVar	Total
Distribution of clinical significance for variants retrieved in MEDLINE and PubMed Central (N = 1232)	Pathogenic	16.64%	19.14%	17.78%
	Benign	28.68%	11.09%	20.70%
	Unknown significance	54.68%	69.77%	61.53%
Distribution of clinical significance for variants retrieved in the SD but not in MEDLINE and PubMed Central (N = 571)	Pathogenic	7.69%	8.62%	8.41%
	Benign	3.08%	12.93%	10.68%
	Unknown significance	89.23%	78.46%	80.91%

We then evaluated the variants found in the supplementary documents for six queries, for a total of 20 documents in the BRCA benchmark and 21 documents in the ClinVar benchmark. We also tried to assess the type of information found about the variants (Table 4). The increased search effectiveness is consistent with the results found in the previous section: 100% of the SD were previously unseen documents for the two BRCA variants, while 95% were unseen for the four ClinVar variants. BRCA variants were found in more supplementary documents with on average 19.5 retrieved documents per query, compared to ClinVar variant requests that returned on average six documents. Two-thirds of the retrieved documents were correct, which provide an estimate of the precision of the search. When comparing the two benchmarks, we observed that the ClinVar benchmark showed less accurate results than the BRCA benchmark. Each supplementary document contained hundreds of variants. These results are not discussed individually in the full text either because they are benign variants appearing as part of an evaluation benchmark, because the evaluation does not demonstrate any pathogenicity or because they are discussed more generally in combination with other results. Regarding the type of information found in the SD, more than half were computational pathogenicity predictions, 22% reported allele frequency in various populations and 17% were genome-wide association studies (GWAS). Less informative data such as benchmark for computational prediction concerned assessed benign variants.

Table 4.

Manual analysis of the documents retrieved in the supplementary materials

	BRCA	ClinVar	Total
Average percentage of novelty in the SD	100%	95%	98%
Average number (median) of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	19.5 (min: 18; max: 21)	6 (min: 3; max: 8)	7.5 (min: 3; max: 21)
Average precision (median) for variants found in the SD	75% (min: 60%; max: 90%)	53% (min: 0%; max: 80%)	63% (min: 0; max: 90%)
Information type found in the SD	Pathogenicity prediction and allele frequency in population	Pathogenicity prediction, allele frequency in population and GWAS	Pathogenicity prediction, allele frequency in population and GWAS

	BRCA	ClinVar	Total
Average percentage of novelty in the SD	100%	95%	98%
Average number (median) of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	19.5 (min: 18; max: 21)	6 (min: 3; max: 8)	7.5 (min: 3; max: 21)
Average precision (median) for variants found in the SD	75% (min: 60%; max: 90%)	53% (min: 0%; max: 80%)	63% (min: 0; max: 90%)
Information type found in the SD	Pathogenicity prediction and allele frequency in population	Pathogenicity prediction, allele frequency in population and GWAS	Pathogenicity prediction, allele frequency in population and GWAS

Table 4.

Manual analysis of the documents retrieved in the supplementary materials

	BRCA	ClinVar	Total
Average percentage of novelty in the SD	100%	95%	98%
Average number (median) of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	19.5 (min: 18; max: 21)	6 (min: 3; max: 8)	7.5 (min: 3; max: 21)
Average precision (median) for variants found in the SD	75% (min: 60%; max: 90%)	53% (min: 0%; max: 80%)	63% (min: 0; max: 90%)
Information type found in the SD	Pathogenicity prediction and allele frequency in population	Pathogenicity prediction, allele frequency in population and GWAS	Pathogenicity prediction, allele frequency in population and GWAS

	BRCA	ClinVar	Total
Average percentage of novelty in the SD	100%	95%	98%
Average number (median) of documents retrieved in all collections (MEDLINE, PubMed Central and SD)	19.5 (min: 18; max: 21)	6 (min: 3; max: 8)	7.5 (min: 3; max: 21)
Average precision (median) for variants found in the SD	75% (min: 60%; max: 90%)	53% (min: 0%; max: 80%)	63% (min: 0; max: 90%)
Information type found in the SD	Pathogenicity prediction and allele frequency in population	Pathogenicity prediction, allele frequency in population and GWAS	Pathogenicity prediction, allele frequency in population and GWAS

Discussion

The experiments reported in this paper constitute a first attempt to establish and quantify the importance of SD to support personalized health. Our results show that SD are a paramount source of contents to characterize the clinical actionability of sequence variants. However, some of the chosen experimental settings are likely to underestimate such a statement. Indeed, while the search for variants is benefiting from a quite powerful synonym generation engine, so-called SynVar (14), which can associate many synonyms of variants (e.g. BRAF:V600E, BRAF:Val600Glu, and BRAF:1799T>A), the variability of the gene (or gene product) names has not been similarly exploited. It means that the use of the synonyms of a gene or gene product (e.g. serine/threonine-protein kinase B-raf, BRAF1) could have further augmented the recall of our results; therefore, additional experiments would be needed.

While supplementary material represents an important source of information for curating variants, it also raises some challenges. First, supplementary documents often contain hundreds of variants from different genes, thus increasing the likelihood to match a variant with a wrong gene. Second, the processing of SD, and in particular content-based image recognition with optical character recognition (OCR), might generate some normalization errors (e.g. L wrongly recognized as £). Thus, and as reported by Wei et al. (20), some variants might simply not be recognized. Nevertheless, simple approaches or heuristics to improve precision could be implemented. For instance, it would be relatively straightforward to compute the positional distance (at word or offset level) between the gene and the variant. In parallel, improving precision should not be too detrimental for recall, especially with lesser studied variants for which very few publications exist. Ultimately, a user-piloted trade-off functionality between recall and precision could provide the flexibility needed to interactively switch focus on precision for the few highly studied variants, while being able to accommodate the need for broad recall for the overwhelming majority of sequenced variants.

Conclusion

Supplementary materials associated with publications play a critical role in any literature curation pipeline (21), but this seems especially true for the curation of genetic variants. In our experiments, we identified that most of the documents retrieved through the supplementary material collection were simply not found when searching the full text of the articles. SD contents more than double the number of documents retrieved per query, thus significantly reducing the volume of variants for which no articles are identified in the literature. It represents valuable information for assessing rarely studied or unknown significance variant pathogenicity, including population studies or computational predictions. Finally, with a reduction of silence of 63%, our results are consistent—yet stronger—with previous observations by Jimeno Yepes and Verspoor (7), who reported that about half of the published content about genetic variations is found exclusively in the supplementary materials. While FAIR is becoming a top priority on the agenda of global research infrastructures (e.g. the US National Library of Medicine, the European ELIXIR community or the Global BioData Coalition), the proper FAIRification of SD should definitely receive more attention, in particular for research infrastructures maintaining literature search engines.

Data availability

The data used in this article are available under CC-BY 4.0 in Zenodo, at https://dx.doi.org/10.5281/zenodo.7661095 and https://dx.doi.org/10.5281/zenodo.7661195. The datasets were derived from sources in the public domain: ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) and LOVD (https://www.lovd.nl).

Conflict of interest statement

None declared.

Acknowledgments

The work has been supported by the CINECA project, which received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825775. The study also leveraged the SIBiLS services, which are supported by the ELIXIR Data Platform, and benefited from discussions with Luana Licata, Livia Perfetto, Amos Bairoch, Charlotte Nachtegael and Tom Lenaerts. We also would like to thank Melissa Cline, University of California in Santa Cruz (UCSC), for the provision of some of the evaluation benchmarks. Finally, the work benefited from discussions with Valérie Barbié and Daniel Stekhoven, co-PI of the SVIP SPHN project, as well as from the feedback and contribution of Anne Estreicher et Livia Famiglietti, biocurators at Swiss-Prot.

References

Tate

J.G.

Bamford

Jubb

H.C.

et al. (

2019

)

COSMIC: the Catalogue Of Somatic Mutations In Cancer

Nucleic Acids Res.

D941

–

D947

Chakravarty

Gao

Phillips

S.M.

et al. (

2017

)

OncoKB: a precision oncology knowledge base

JCO Precis. Oncol.

2017

–

Landrum

M.J.

Lee

J.M.

Benson

et al. (

2018

)

ClinVar: improving access to variant interpretations and supporting evidence

Nucleic Acids Res.

D1062

–

D1067

Bateman

Martin

M.-J.

and

Orchard

UniProt Consortium

. (

2021

)

UniProt: the universal protein knowledgebase in 2021

Nucleic Acids Res.

D480

–

D489

PubMed

M.M.

Datto

Duncavage

E.J.

et al. (

2017

)

Standards and guidelines for the interpretation and reporting of sequence variants in cancer: a joint consensus recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists

J. Mol. Diagn.

–

Richards

Aziz

Bale

et al. (

2015

)

Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology

Genet. Med.

405

–

424

Jimeno Yepes

and

Verspoor

(

2014

)

Literature mining of genetic variants for curation: quantifying the importance of supplementary material

Database (Oxford)

2014

, bau003.

Naderi

Mottaz

Teodoro

et al. (

2022

)

Analyzing the information content of text-based files in supplementary materials of biomedical literature

Stud. Health Technol. Inform.

294

876

–

877

PubMed

Cohen

Roberts

Gururaj

A.E.

et al. (

2017

)

A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge

Database (Oxford)

2017

, bax061.

10.

Teodoro

Mottin

Gobeill

et al. (

2017

)

Improving average ranking precision in user searches for biomedical research datasets

Database (Oxford)

2017

, bax083.

11.

International Society for Biocuration

. (

2018

)

Biocuration: distilling data into knowledge

PLoS Biol.

, e2002846.

12.

Howe

Costanzo

Fey

et al. (

2008

)

Big data: the future of biocuration

Nature

455

–

13.

Pasche

Mottaz

Caucheteur

et al. (

2022

)

Variomes: a high recall search engine to support the curation of genomic variants

Bioinformatics (Oxford)

2595

–

2601

Crossref

14.

Mottaz

Pasche

Michel

P.A.

et al. (

2022

)

Designing an optimal expansion method to improve the recall of a genomic variant curation-support service

Stud. Health Technol. Inform.

294

839

–

843

PubMed

15.

Gobeill

Caucheteur

Michel

P.A.

et al. (

2020

)

SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts

Nucleic Acids Res.

W12

–

W16

16.

Smith

(

2007

)

An overview of the Tesseract OCR Engine

. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

Curitiba, Brazil

, pp.

629

–

633

17.

Cline

M.S.

Liao

R.G.

Parsons

M.T.

et al. (

2018

)

BRCA challenge: BRCA exchange as a global resource for variants in BRCA1 and BRCA2

PLoS Genet.

, e1007752.

18.

Fokkema

I.F.

Taschner

P.E.

Schaafsma

G.C.

et al. (

2011

)

LOVD v.2.0: the next generation in gene variant databases

Hum. Mutat.

557

–

563

19.

Virtanen

Gommers

Oliphant

T.E.

et al. (

2020

)

SciPy 1.0: fundamental algorithms for scientific computing in Python

Nat. Methods

261

–

272

20.

Wei

C.H.

Phan

Feltz

et al. (

2018

)

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Bioinformatics

–

21.

Kafkas

Ş.

Kim

J.H.

et al. (

2015

)

Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles

J. Biomed. Semantics

, 1.