TRSRD: a database for research on risky substances in tea using natural language processing and knowledge graph-based techniques

Data collection

First, all relevant literature PMIDs were obtained from PubMed using keywords such as tea, pesticide, and food safety. Subsequently, PMIDs were obtained from the PubMed database using Biopython (24) to extract the desired literature’s titles, DOIs, keywords, and abstracts. At this stage, papers without keywords or abstracts were removed from the dataset. For the remaining papers without keywords but with an abstract, KeyBERT (25) was used to generate six keywords, resulting in 5177 usable results. Finally, the initial data were obtained by retrieving and filtering information from the titles, keywords, and abstracts and deleting papers irrelevant to tea. The valuable information was selected here to facilitate later overview and classification operations on this information without the effect of useless information (26).

Text summarization and classification

After obtaining the initial filtered data, the long abstracts were compressed by Google’s pegasus-large model (27) for paragraph comprehension, information compression (28), and language generation (29) to produce an abstract overview of only a few sentences, the purpose of this step is to compress the long abstracts into short, fluent, readable texts that retain the most salient information (30). The title, keywords, abstract, and summary are utilized collectively to classify each paper into two categories: the risk substance category (comprising of inclusion pollutants (comprehensive discussion), heavy metals, pesticides, environmental pollutants, mycotoxins, microorganisms, radioactive isotopes, plant growth regulators, and others) and the paper’s research category (including review, safety evaluation/risk assessment, prevention and control measures, detection methods, residual/pollution situation, and data analysis/data measurement). If the keywords and abstracts contain the content of a risky substance category and are of a single type, they will be assigned directly to that category, otherwise, they will be classified as pollutants (comprehensive discussion). As there is currently no classification model for tea research, the remaining papers were manually classified, and those that were not manually classified were checked twice at a later stage to ensure correctness. Table 1 shows the risk substance categories and their classification criteria. Table 2 presents the classification criteria for the paper study categories.

Table 1.

Risk substance categories and their classification criteria

Risk substance category	Classification criteria
Pollutants (comprehensive discussion)	This item is used when two or more different pollutants are included.
Heavy metals	This item is used when only heavy metals are contained.
Pesticides	Only contains pesticides or insecticides for this item.
Environmental pollutants	This item is used when only environmental pollutants are contained.
Mycotoxins	This item is used when only mycotoxins are contained.
Microorganisms	This item is used when only microorganisms are contained.
Radioactive isotopes	This item is used when only radioactive isotopes are contained.
Plant growth regulators	This item is used when only plant growth regulators are contained.
Others	This is the item if it is not one of the above categories.

Risk substance category	Classification criteria
Pollutants (comprehensive discussion)	This item is used when two or more different pollutants are included.
Heavy metals	This item is used when only heavy metals are contained.
Pesticides	Only contains pesticides or insecticides for this item.
Environmental pollutants	This item is used when only environmental pollutants are contained.
Mycotoxins	This item is used when only mycotoxins are contained.
Microorganisms	This item is used when only microorganisms are contained.
Radioactive isotopes	This item is used when only radioactive isotopes are contained.
Plant growth regulators	This item is used when only plant growth regulators are contained.
Others	This is the item if it is not one of the above categories.

Table 1.

Risk substance categories and their classification criteria

Risk substance category	Classification criteria
Pollutants (comprehensive discussion)	This item is used when two or more different pollutants are included.
Heavy metals	This item is used when only heavy metals are contained.
Pesticides	Only contains pesticides or insecticides for this item.
Environmental pollutants	This item is used when only environmental pollutants are contained.
Mycotoxins	This item is used when only mycotoxins are contained.
Microorganisms	This item is used when only microorganisms are contained.
Radioactive isotopes	This item is used when only radioactive isotopes are contained.
Plant growth regulators	This item is used when only plant growth regulators are contained.
Others	This is the item if it is not one of the above categories.

Risk substance category	Classification criteria
Pollutants (comprehensive discussion)	This item is used when two or more different pollutants are included.
Heavy metals	This item is used when only heavy metals are contained.
Pesticides	Only contains pesticides or insecticides for this item.
Environmental pollutants	This item is used when only environmental pollutants are contained.
Mycotoxins	This item is used when only mycotoxins are contained.
Microorganisms	This item is used when only microorganisms are contained.
Radioactive isotopes	This item is used when only radioactive isotopes are contained.
Plant growth regulators	This item is used when only plant growth regulators are contained.
Others	This is the item if it is not one of the above categories.

Table 2.

Research categories of dissertations and their classification criteria

Research category	Classification criteria
Review	Summary overview of hazardous substances
Safety evaluation/risk assessment	Assess the risk of the hazardous substance or study its harm to humans
Prevention and control measures	Reduce the residue of the hazardous substance or avoid the use of the hazardous substance
Detection methods	Innovative detection methods for this hazardous substance
Residual/pollution situation	Study the residual pattern of the pollutant or investigate the pollution of a certain area by the pollutant
Data analysis/data measurement	Large-scale data collection and analysis of pollutants at a site

Research category	Classification criteria
Review	Summary overview of hazardous substances
Safety evaluation/risk assessment	Assess the risk of the hazardous substance or study its harm to humans
Prevention and control measures	Reduce the residue of the hazardous substance or avoid the use of the hazardous substance
Detection methods	Innovative detection methods for this hazardous substance
Residual/pollution situation	Study the residual pattern of the pollutant or investigate the pollution of a certain area by the pollutant
Data analysis/data measurement	Large-scale data collection and analysis of pollutants at a site

Table 2.

Research categories of dissertations and their classification criteria

Research category	Classification criteria
Review	Summary overview of hazardous substances
Safety evaluation/risk assessment	Assess the risk of the hazardous substance or study its harm to humans
Prevention and control measures	Reduce the residue of the hazardous substance or avoid the use of the hazardous substance
Detection methods	Innovative detection methods for this hazardous substance
Residual/pollution situation	Study the residual pattern of the pollutant or investigate the pollution of a certain area by the pollutant
Data analysis/data measurement	Large-scale data collection and analysis of pollutants at a site

Research category	Classification criteria
Review	Summary overview of hazardous substances
Safety evaluation/risk assessment	Assess the risk of the hazardous substance or study its harm to humans
Prevention and control measures	Reduce the residue of the hazardous substance or avoid the use of the hazardous substance
Detection methods	Innovative detection methods for this hazardous substance
Residual/pollution situation	Study the residual pattern of the pollutant or investigate the pollution of a certain area by the pollutant
Data analysis/data measurement	Large-scale data collection and analysis of pollutants at a site

Syntactic analysis and entity extraction

After classifying the paper, in order to find out the chemical substances, we need to first analyze the syntactic structure of the sentences (31) in the abstract, and then, meaningful information is identified from the split syntactic structure by named entity recognition (32). With the research and development of natural language processing, natural language processing models trained on specific datasets are also often used in the biomedical field. Biomedical natural language processing is often used for word sense disambiguation, named entity recognition, information extraction, and relation extraction (33).

To ensure the accuracy of named entity recognition, we chose Stanza (34), a neural natural language processing package customized for biomedical text processing, using the biomedical and clinical syntactic analysis and named entity recognition models provided in Stanza. For the abstracts after classification, chosen a syntactic analysis pipeline trained on the MIMIC clinical dataset (35) and a named entity recognition model pre-trained on the BC4CHEMD corpus (36). After separating the chemical entities, these substances are grouped into corresponding papers to build a database.

Build database and website

For database selection, we chose the Neo4j graph database in order to store the connections and relationships between data in more flexible numbers with data elements. In Neo4j, the data are stored like a whiteboard, which makes Neo4j flexible compared to other graph databases. While doing the above advantages, the Neo4j graph database also has better performance (37).

Finally, these data are aggregated, building nodes and relationships using the Neo4j graph database to create a Tea Risk Substance Research Database (TRSRD). TRSRD allows researchers to easily access and analyze the data we have collected and processed.

Results

Knowledge mapping for tea risk substance research

With 4189 nodes and 9400 associations, TRSRD divides all the literature into nine risk substance categories and six paper research categories and also extracts 955 different tea risk substances from the classified papers to build a knowledge graph, so that researchers can quickly visualize the different hazardous substances in tea without having to do a lot of searching and summarizing in the field. This allows researchers to quickly visualize the different studies on harmful substances in tea without having to do extensive searches in the field, which helps in the referencing and citation of research. A complete visualization of the data in the database is shown in Figure 2.

Figure 2.

Overview of all node relationships in the Neo4j database.

Neo4j database management website

The developers mainly use this site to maintain and check the data, and when the data need to be updated, new data can be redeployed from this site. It is also possible to view the visualization of all the data in the database as shown in Figure 3.

Figure 3.

Neo4j database management site showcase.

Website (TRSRD) page

The TRSRD website provides a friendly and intuitive interactive interface that allows users to browse the site’s introduction, search the Neo4j database (returned as visible results), view data (in tabular form), and download all data. We built the site using a front- and back-end separation, using Spring Boot to connect to the Neo4j database as the back end of the TRSRD site and native HyperText Markup Language with a JavaScript framework to build the front end of the TRSRD site. Figure 4 shows the details of the TRSRD.

Figure 4.

All pages of the TRSRD are displayed. (A) The homepage of the website provides a brief introduction of the website content and presents data statistics. (B) The knowledge graph search page of the website allows for visual exploration of the data. (C) The data browsing page of the website enables viewing and searching of the data in tabular format. (D) The download page of the website allows for direct downloading of the database data. (E) The about page of the website lists the data sources and other related database links.

Website main page

The website’s main page shows an introduction and overview of TRSRD, while the database statistics on the left-hand side show the distribution of risk substance data and research papers in the database and are presented in a Nightingale rose diagram (Figure 4A).

Search and browse page

The search page on the website enables users to input specific keywords and obtain visible search results, facilitating intuitive data retrieval and query. Clicking on a node allows users to view the < id> of the node and the corresponding value (Figure 4B). The browse page displays all the data in a table format and searches for the specified keywords (Figure 4C). On the browse page, you can view all temporarily included paper data (sorted by publication date), extracted risk substance data and the Maximum Residue Levels standards specified by different countries or organizations.

Download and About page

The download page lets users download the data format they need in Neo4j database import format, CSV table format, and JavaScript Object Notation format (Figure 4D). On the About page, users can view links to research and data related to the site and links to technical web development, which can be accessed by clicking on the corresponding website. In addition, in the contact section below, users can get our email (Figure 4E).

Conclusion

As tea becomes increasingly popular in various countries, more and more people are becoming aware of it. At the same time, research into the safety of tea has gradually increased. However, papers on tea quality and safety have never been systematically integrated. As a result, researchers cannot quickly and easily search for harmful substances in their desired field. In this paper, we have constructed a database centered on tea risk substance research by filtering and classifying existing tea research data through natural language processing and performing named entity extraction. Researchers can use TRSRD to better understand the risk substances in tea and the corresponding research, which is an essential reference for exploring the formation of risk substances in tea and future safety standards for tea.

As tea research continues to develop, we will continue to understand the harmful substances in tea and improve its safety standards. Our current work may still have shortcomings, such as the classification of chemical substances is not precise enough, not for the whole industry chain to make a more detailed way of classification. In addition, the model used on named entity identification does not achieve perfect identification and still requires manual screening of data. In the future, we will continue to monitor the latest research results in this field, incorporate more hazardous substances into the TRSRD, enrich the classification in the TRSRD, improve the classification criteria, and make the TRSRD a reference and citation platform among tea researchers.

Data availability

All required data are contained in the database website and are available for download by all users. The database website address is http://trsrd.wpengxs.cn.

Funding

Research Projects of Anhui Higher Education Institutions (Natural Science, 2022AH040122).

Conflict of interest statement

None declared.

References

Khan

and

Mukhtar

(

2013

)

Tea and health: studies in humans

Curr. Pharm. Des.

6141

–

6147

Zhai

Zhang

Granvogl

et al. (

2022

)

Flavor of tea (Camellia sinensis): a review on odorants and analytical techniques

Compr. Rev. Food Sci. Food Saf.

3867

–

3909

Graham

H.N.

(

1992

)

Green tea composition, consumption, and polyphenol chemistry

Prev. Med.

334

–

350

Khan

and

Mukhtar

(

2019

)

Tea polyphenols in promotion of human health

Nutrients

, 39.

Saeed

Naveed

Arif

et al. (

2017

)

Green tea (Camellia sinensis) and l-theanine: medicinal values and beneficial applications in humans—a comprehensive review

Biomed. Pharmacother.

1260

–

1275

Yang

C.S.

Zhang

et al. (

2016

)

Mechanisms of body weight reduction and metabolic syndrome alleviation by tea

Mol. Nutr. Food Res.

160

–

174

Suzuki

Pervin

Goto

et al. (

2016

)

Beneficial effects of tea and the green tea catechin epigallocatechin-3-gallate on obesity

Molecules

, 1305.

Yang

C.S.

Wang

G.X.

et al. (

2011

)

Cancer prevention by tea: evidence from laboratory studies

Pharmacol. Res.

113

–

122

Zhang

et al. (

2021

)

The neuroprotective effect of tea polyphenols on the regulation of intestinal flora

Molecules

, 3692.

10.

Chung

Zhao

Wang

et al. (

2020

)

Dose–response relation between tea consumption and risk of cardiovascular disease and all-cause mortality: a systematic review and meta-analysis of population-based studies

Adv. Nutr.

790

–

814

11.

Wei

Huang

and

Yang

(

2012

)

The impacts of food safety standards on China’s tea exports

China Econ. Rev.

253

–

264

12.

Chen

and

Liu

X.R.

(

2016

)

Analysis of Tea Pesticide Residue Standards and Testing Methods

Atlantis Press, Amsterdam

, pp.

876

–

879

13.

Gurusubramanian

Rahman

Sarmah

et al. (

2008

)

Pesticide usage pattern in tea ecosystem, their retrospects and alternative measures

J. Environ. Biol.

813

–

826

PubMed

14.

E.H.

Huang

S.Z.

T.H.

et al. (

2020

)

Systematic probabilistic risk assessment of pesticide residues in tea leaves

Chemosphere

247

, 125692.

15.

Zhang

Yang

Chen

et al. (

2018

)

Accumulation of heavy metals in tea leaves and potential health risk assessment: a case study from Puan County, Guizhou Province, China

Int. J. Environ. Res. Public Health

, 133.

16.

Abd El-Aty

A.M.

Choi

J.H.

Rahman

et al. (

2014

)

Residues and contaminants in tea and tea infusions: a review

Food Addit. Contam. A

1794

–

1804

17.

Cladière

Delaporte

Le Roux

et al. (

2018

)

Multi-class analysis for simultaneous determination of pesticides, mycotoxins, process-induced toxicants and packaging contaminants in tea

Food Chem.

242

113

–

121

18.

Wang

Zhou

Luo

et al. (

2018

)

9,10-Anthraquinone deposit in tea plantation might be one of the reasons for contamination in tea

Food Chem.

244

254

–

259

19.

Liao

Cao

and

Gao

(

2022

)

Monitoring and risk assessment of perchlorate in tea samples produced in China

Food Res. Int.

157

, 111435.

20.

Cohen

K.B.

and

Hunter

(

2008

)

Getting started in text mining

PLoS Comput. Biol.

, e20.

21.

Minaee

Kalchbrenner

Cambria

et al. (

2021

)

Deep learning-based text classification: a comprehensive review

ACM Comput. Surv.

, 62:1–62. 40.

22.

Auer

Kovtun

Prinz

et al. (

2018

)

Towards a knowledge graph for science

. In: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics.

WIMS ’18, Association for Computing Machinery

New York

, pp.

–

23.

Kim

Song

et al. (

2020

)

Building a PubMed knowledge graph

Sci Data

, 205.

. https://maartengr.github.io/KeyBERT.

24.

Cock

P.J.A.

Antao

Chang

J.T.

et al. (

2009

)

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Bioinformatics

1422

–

1423

25.

Grootendorst

Mishra

Matsak

et al.

MaartenGr/KeyBERT: v0.7.0

26.

Ananiadou

and

McNaught

(

2006

)

Text Mining for Biology and Biomedicine

Artech House, Inc

. https://research.manchester.ac.uk/en/publications/text-mining-for-biology-and-biomedicine

30 December 2022, accessed date last

27.

Zhang

Zhao

Saleh

et al. (

2020

)

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

. In: Proceedings of the 37th International Conference on Machine Learning .

PMLR

. pp.

11328

–

11339

28.

Wolff

J.G.

(

1993

)

Computing, cognition and information compression

AI Commun.

107

–

127

29.

Gatt

and

Krahmer

(

2018

)

Survey of the state of the srt in natural language generation: core tasks, applications and evaluation

J. Artif. Intell. Res.

–

170

30.

Fabbri

A.R.

Kryściński

McCann

et al. (

2021

)

SummEval: re-evaluating summarization evaluation

Trans. Assoc. Comput. Linguist.

391

–

409

31.

Sager

(

1967

) Syntactic analysis of natural language. In:

Alt

Rubinoff

(eds)

Advances in Computers

. Vol.

Amsterdam

Elsevier, Amsterdam

, pp.

153

–

188

Google Preview

32.

Goyal

Gupta

and

Kumar

(

2018

)

Recent named entity recognition and classification techniques: a systematic review

Comput. Sci. Rev.

–

33.

Houssein

E.H.

Mohamed

R.E.

and

Ali

A.A.

(

2021

)

Machine learning techniques for biomedical natural language processing: a comprehensive review

IEEE Access

140628

–

140653

34.

Zhang

et al. (

2021

)

Biomedical and clinical English model packages for the Stanza Python NLP library

J. Am. Med. Inform. Assoc.

1892

–

1899

35.

Johnson

A.E.W.

Pollard

T.J.

Shen

et al. (

2016

)

MIMIC-III, a freely accessible critical care database

Sci Data

, 160035.

36.

Krallinger

Rabal

Leitner

et al. (

2015

)

The CHEMDNER corpus of chemicals and drugs and its annotation principles

J. Cheminform

, S2.