Automatic query generation using word embeddings for retrieving passages describing experimental methods

Author Notes

Abstract

Information regarding the physical interactions among proteins is crucial, since protein–protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative–Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency–relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article.

Database URL: https://github.com/ferhtaydn/biocemid/

Introduction

The functions of proteins are often modulated through their interactions with other proteins. Protein–protein interactions (PPIs) play important roles in many biological processes including cell cycle control, DNA replication, translation, transcription and metabolic and signaling pathway (1). A number of databases such as BioGrid (2), IntAct (3), DIP (4), MINT (5) and BIND (6) have been developed to store PPI information in well structured format in order to facilitate data retrieval and systematic analysis. The PPI information in these databases is extracted manually by human curators from the published literature. However, manual curation is a laborious and time consuming task. Therefore, it is only able to handle a small fraction of the rapidly growing biomedical literature (7). In order to address this challenge, several text-mining studies have been conducted for automatically extracting information from the published articles. The community-wide shared tasks such as BioCreative (8–10) and BioNLP (11–13) have played important roles for promoting research in this area. Being one of the main tasks in these community-wide efforts, extracting interactions among proteins has gained significant attention from the researchers. Although improvements have been obtained in extracting PPIs from text in the recent years (14, 15), enriching PPIs with context information including the experimental methods used to detect the PPIs has not been well studied yet (16). Various experimental methods such as ‘affinity capture’, ‘two-hybrid’ and ‘coimmunoprecipitation’ are available for detecting protein interactions (1). Experimental methods have different degrees of resolution, confidence and reliability. Therefore, besides the existence of an interaction between a pair of proteins, the experimental conditions in which this interaction was observed are also very important for the interpretation and assessment of the interaction (16).

The problem of identifying the experimental methods used to detect a given PPI in an article was tackled by the Interaction Method Subtask (IMS) of the BioCreative II challenge (14). Two teams participated in the sub-task (17, 18). Rinaldi et al. (17) obtained promising results by using mostly manually crafted patterns for matching the experimental method terms in the provided ontology against the full text article including the PPI. Ehrler et al. (18) used a pattern matching and vector space retrieval based model. A similar task, namely the Interaction Method Task (IMT), was also addressed at the BioCreative III challenge (8, 19). The goal in the IMT task at BioCreative III was to identify the experimental methods in a given full text article and map them to the interaction detection method terms in the PSI-MI ontology (20, 21). The positions of the experimental methods in the articles were not required to be identified.

Most previous studies on experimental method detection, including the ones in the BioCreative challenges, used pattern matching and/or machine learning based approaches. In the pattern matching based approach the experimental method names in text are matched against the terms in a lexicon or ontology such as the PSI-MI ontology using usually hand-crafted patterns (22–25). Pattern matching based methods are able to identify the positions of the experimental method mentions in the articles. However, they fail to identify the experimental methods when they occur in forms that do not match the designed patterns. In order to handle approximate matches, Matos et al. developed an Information Retrieval based system, where the test documents are indexed and searched for experimental methods using the Lucene search library (26). In the machine learning based approach, the task of experimental method detection is in general formulated as a text classification task, where the classes are defined as the experimental methods and the goal is to classify the articles into zero or more of these classes. Machine learning based methods obtained promising results in the BioCreative III challenge, where different classification algorithms such as Naive Bayes (27), Random Forest (28), Support Vector Machines (29), and Logistic Regression (29) were utilized. Machine learning based methods classify articles as containing a certain experimental method or not. Experimental methods can be detected, even if they don’t occur with their standard names or synonyms. However, the positions of the experimental methods in the articles are not identified.

In this article, we approach the problem of experimental method detection as a passage retrieval task. We target identifying passages (i.e. sequences of sentences) where certain experimental methods are described. In many cases, experimental method descriptions span multiple sentences. Passage-level retrieval is especially crucial for articles in which multiple PPIs and experimental methods are mentioned. Passage-level retrieval can help mapping PPIs to their corresponding experimental methods. For instance, consider the sample text from (30) shown in Figure 1. The text describes three experimental methods used to identify the proteins interacting with the ‘TANK-binding kinase 1’ (TBK1) protein. The passages describing the experimental methods ‘tandem affinity purification’ (MI:0676), ‘mass spectrometry studies of complexes’ (MI:0069) and ‘coimmunoprecipitation’ (MI:0019) are highlighted with yellow, purple and green, respectively. Different PPIs were observed by using these three experimental methods. For example, the passage about the ‘coimmunoprecipitation’ experiment (shown in green) states that no interactions were observed between the protein pairs TBK1-DDX3X, TBK1-IRF3 and DDX3X-IRF3 by using the ‘coimmunoprecipitation’ experimental technique. This example illustrates that experimental method descriptions may span multiple sentences. In addition, it demonstrates that identifying the passages describing the experimental methods is important for resolving which method detected which of the PPIs described in text.

Figure 1.

Sample text with multiple PPIs and experimental methods taken from the Results section of (30). The text describes three experimental interaction detection methods used to identify the proteins interacting with the ‘TBK1’ protein. The passages describing the experimental interaction detection methods ‘tandem affinity purification’ (MI:0676), ‘mass spectrometry studies of complexes’ (MI:0069), and ‘coimmunoprecipitation’ (MI:0019) are highlighted with yellow, purple and green, respectively.

Open in new tab Download slide

We describe two query matching approaches for retrieving passages related to physical PPI detection methods from articles. The first approach is based on generating queries using the term frequency–relevance frequency (tf.rf) metric and was developed as part of our participation in the BioCreative V BioC Track (31). The aim of the BioC track was to develop BioC-compatible (32) modules integrated together to form a text-mining system to assist biocurators (33, 34). Our second approach is based on generating queries by using the word embeddings of the experimental method names (i.e. the canonical name and synonyms) in the PSI-MI ontology (The ontology is available at http://www.ebi.ac.uk/ols/beta/ontologies/mi) (PSI-MI, Version: 2.5, RRID:SCR_010710). We obtained the word embeddings by using the ‘word2vec’ (Word2vec Tool: http://word2vec.googlecode.com/; Revision-42:http://word2vec.googlecode.com/svn/trunk/) tool (word2vec, Version: Revision 42, RRID:SCR_014776) (35), which is an efficient implementation of neural networks based learning techniques for constructing word vectors from large unlabeled data sets with billions of words (36). As an additional contribution of this study, a data set consisting of 30 full text articles is manually annotated for passages describing experimental methods and made publicly available.

Materials and methods

Data set

To the best of our knowledge, there does not exist a data set annotated for experimental interaction detection methods (with MI ontology identifiers) at the passage level (with exact location in the article). The available data sets for experimental methods are annotated at the article level (e.g. the BioCreative II IMS and BioCreative III IMT data sets (BioCreative, RRID:SCR_006311) (14, 19)). In other words, only the list of experimental methods for each article is provided. Therefore, we manually annotated a data set of full text articles at the passage level by selecting a subset of the BioCreative III IMT task data set. The subset of articles was selected according to the availability of the articles in ‘PMC Open Access’ (http://www.ncbi.nlm.nih.gov/pmc/) (PubMed Central, RRID:SCR_004166) (37), as full text, as well as their availability in BioC format. 30 articles from this subset were randomly selected and annotated for passages (i.e. sequences of sentences) that describe an experimental method as an evidence for a physical PPI and for the specific method that each passage describes by two annotators who have natural language processing and information retrieval background. The disagreements between the two annotators were resolved collaboratively. Then, the annotations of the test set consisting of 17 articles were checked, validated, and corrected whenever necessary by a domain expert. These final annotations were used as the gold standard. The Inter Annotator Agreement (IAA) over the test set is computed by comparing the combined annotations of the two annotators (after resolving the disagreement between them) against the gold standard test set checked by the domain expert. The evaluation approach described in ‘Evaluation’ section is used to measure IAA precision, recall and F-measure (38), which are computed as 0.787, 0.937 and 0.856, respectively. The annotated data set is publicly available (https://github.com/ferhtaydn/biocemid/tree/odj/files/published_dataset) (Biocemid, RRID:SCR_014779).

The data set of 30 articles is split into two parts, where the first part comprises 13 articles and is used as training set in the tf.rf based approach and as validation set in the word embeddings based approach. The remaining 17 articles, which were checked and validated by the domain expert, are used as test set for all methods developed in this study. The total number of annotated passages, the total number of paragraphs which have at least one annotated passage, and the total number of paragraphs which do not contain any annotated passages in the data set of 30 articles are 370, 292 and 1194, respectively.

A sample annotation from a paragraph of an article in the data set is shown in Figure 2. Each annotation has an identifier that is incremented by one throughout the article and two infons, which store key-value pairs with any required information in the context (32). The value of the ‘type’ infon is set to ‘ExperimentalMethod’ for all annotations and the value of the ‘PSIMI’ infon is set to the PSI-MI identifier of the interaction detection method. The ‘text’ tag holds the annotated sentence(s). The ‘location’ tag holds the position of the annotated portion in the article with the ‘offset’ and ‘length’ attributes. As illustrated in Figure 2, different passages (sequences of sentences) in a paragraph can be annotated with different experimental methods. It is also possible that multiple experimental methods are explained in the same passage of a paragraph. In this case, the corresponding passage of the paragraph is annotated with each experimental method separately. If a paragraph comprises a continuous and coherent explanation of one experimental method, then the whole paragraph is annotated with that method only.

Figure 2.

A sample annotation from a paragraph of an article in the data set. Each annotation has an identifier that is incremented by one throughout the article. Moreover, the value of the ‘type’ infon is static and set to ‘ExperimentalMethod’ for all annotations. The value of the ‘PSIMI’ infon is set to the PSI-MI identifier of the interaction detection method. The ‘text’ tag holds the annotated sentence(s). The ‘location’ tag holds the position of the annotated portion in the article with the ‘offset’ and ‘length’ attributes.

Open in new tab Download slide

The articles are annotated by considering 103 interaction detection methods (https://github.com/ferhtaydn/biocemid/blob/odj/files/103_methods.txt) in the PSI-MI ontology (the nodes under ‘MI:0045’ (http://purl.obolibrary.org/obo/MI_0045) which defines ‘experimental interaction detection’). The annotation statistics for the 35 interaction detection methods that are annotated in at least one article in the data set are shown in Table 1. The PSI-MI identifiers of the methods, their canonical names in the PSI-MI ontology, the number of articles each method occurs in, as well as the total number of passages annotated for each method are presented in the table. Fifteen interaction detection methods are annotated in only one article and seven methods are annotated in only one passage. The most common methods at the article-level (i.e. annotated in the highest number of different articles) are ‘pull down’, ‘coimmunoprecipitation’, ‘two hybrid’, ‘anti bait coimmunoprecipitation’ and ‘anti tag coimmunoprecipitation’. The most common methods at the passage-level are ‘two hybrid’, ‘coimmunoprecipitation’, ‘pull down’, ‘nuclear magnetic resonance’, ‘chromatin immunoprecipitation assay’, ‘anti bait coimmunoprecipitation’ and ‘x-ray crystallography’.

Table 1.

List of experimental interaction detection methods which are annotated in at least one article in the manually annotated data set

Id	Name	Articles	Passages
MI:0004	affinity chromatography technology	3	5
MI:0006	anti bait coimmunoprecipitation	8	23
MI:0007	anti tag coimmunoprecipitation	7	14
MI:0014	adenylate cyclase complementation	2	2
MI:0017	classical fluorescence spectroscopy	1	4
MI:0018	two hybrid	10	54
MI:0019	coimmunoprecipitation	14	49
MI:0029	cosedimentation through density gradient	1	1
MI:0030	cross-linking study	2	5
MI:0040	electron microscopy	1	4
MI:0053	fluorescence polarization spectroscopy	1	1
MI:0054	fluorescence-activated cell sorting	3	4
MI:0055	fluorescent resonance energy transfer	3	10
MI:0065	isothermal titration calorimetry	3	8
MI:0071	molecular sieving	4	9
MI:0077	nuclear magnetic resonance	4	27
MI:0081	peptide array	1	4
MI:0096	pull down	15	43
MI:0104	static light scattering	1	1
MI:0107	surface plasmon resonance	2	3
MI:0114	x-ray crystallography	5	21
MI:0276	blue native page	1	2
MI:0402	chromatin immunoprecipitation assay	5	24
MI:0411	enzyme linked immunosorbent assay	2	3
MI:0412	electrophoretic mobility supershift assay	1	2
MI:0413	electrophoretic mobility shift assay	1	6
MI:0416	fluorescence microscopy	5	15
MI:0419	gtpase assay	2	4
MI:0423	in-gel kinase assay	1	1
MI:0426	light microscopy	1	1
MI:0663	confocal microscopy	3	6
MI:0676	tandem affinity purification	1	4
MI:0809	bimolecular fluorescence complementation	1	8
MI:0858	immunodepleted coimmunoprecipitation	1	1
MI:0889	acetylase assay	1	1

Id	Name	Articles	Passages
MI:0004	affinity chromatography technology	3	5
MI:0006	anti bait coimmunoprecipitation	8	23
MI:0007	anti tag coimmunoprecipitation	7	14
MI:0014	adenylate cyclase complementation	2	2
MI:0017	classical fluorescence spectroscopy	1	4
MI:0018	two hybrid	10	54
MI:0019	coimmunoprecipitation	14	49
MI:0029	cosedimentation through density gradient	1	1
MI:0030	cross-linking study	2	5
MI:0040	electron microscopy	1	4
MI:0053	fluorescence polarization spectroscopy	1	1
MI:0054	fluorescence-activated cell sorting	3	4
MI:0055	fluorescent resonance energy transfer	3	10
MI:0065	isothermal titration calorimetry	3	8
MI:0071	molecular sieving	4	9
MI:0077	nuclear magnetic resonance	4	27
MI:0081	peptide array	1	4
MI:0096	pull down	15	43
MI:0104	static light scattering	1	1
MI:0107	surface plasmon resonance	2	3
MI:0114	x-ray crystallography	5	21
MI:0276	blue native page	1	2
MI:0402	chromatin immunoprecipitation assay	5	24
MI:0411	enzyme linked immunosorbent assay	2	3
MI:0412	electrophoretic mobility supershift assay	1	2
MI:0413	electrophoretic mobility shift assay	1	6
MI:0416	fluorescence microscopy	5	15
MI:0419	gtpase assay	2	4
MI:0423	in-gel kinase assay	1	1
MI:0426	light microscopy	1	1
MI:0663	confocal microscopy	3	6
MI:0676	tandem affinity purification	1	4
MI:0809	bimolecular fluorescence complementation	1	8
MI:0858	immunodepleted coimmunoprecipitation	1	1
MI:0889	acetylase assay	1	1

Table 1.

List of experimental interaction detection methods which are annotated in at least one article in the manually annotated data set

Id	Name	Articles	Passages
MI:0004	affinity chromatography technology	3	5
MI:0006	anti bait coimmunoprecipitation	8	23
MI:0007	anti tag coimmunoprecipitation	7	14
MI:0014	adenylate cyclase complementation	2	2
MI:0017	classical fluorescence spectroscopy	1	4
MI:0018	two hybrid	10	54
MI:0019	coimmunoprecipitation	14	49
MI:0029	cosedimentation through density gradient	1	1
MI:0030	cross-linking study	2	5
MI:0040	electron microscopy	1	4
MI:0053	fluorescence polarization spectroscopy	1	1
MI:0054	fluorescence-activated cell sorting	3	4
MI:0055	fluorescent resonance energy transfer	3	10
MI:0065	isothermal titration calorimetry	3	8
MI:0071	molecular sieving	4	9
MI:0077	nuclear magnetic resonance	4	27
MI:0081	peptide array	1	4
MI:0096	pull down	15	43
MI:0104	static light scattering	1	1
MI:0107	surface plasmon resonance	2	3
MI:0114	x-ray crystallography	5	21
MI:0276	blue native page	1	2
MI:0402	chromatin immunoprecipitation assay	5	24
MI:0411	enzyme linked immunosorbent assay	2	3
MI:0412	electrophoretic mobility supershift assay	1	2
MI:0413	electrophoretic mobility shift assay	1	6
MI:0416	fluorescence microscopy	5	15
MI:0419	gtpase assay	2	4
MI:0423	in-gel kinase assay	1	1
MI:0426	light microscopy	1	1
MI:0663	confocal microscopy	3	6
MI:0676	tandem affinity purification	1	4
MI:0809	bimolecular fluorescence complementation	1	8
MI:0858	immunodepleted coimmunoprecipitation	1	1
MI:0889	acetylase assay	1	1

Id	Name	Articles	Passages
MI:0004	affinity chromatography technology	3	5
MI:0006	anti bait coimmunoprecipitation	8	23
MI:0007	anti tag coimmunoprecipitation	7	14
MI:0014	adenylate cyclase complementation	2	2
MI:0017	classical fluorescence spectroscopy	1	4
MI:0018	two hybrid	10	54
MI:0019	coimmunoprecipitation	14	49
MI:0029	cosedimentation through density gradient	1	1
MI:0030	cross-linking study	2	5
MI:0040	electron microscopy	1	4
MI:0053	fluorescence polarization spectroscopy	1	1
MI:0054	fluorescence-activated cell sorting	3	4
MI:0055	fluorescent resonance energy transfer	3	10
MI:0065	isothermal titration calorimetry	3	8
MI:0071	molecular sieving	4	9
MI:0077	nuclear magnetic resonance	4	27
MI:0081	peptide array	1	4
MI:0096	pull down	15	43
MI:0104	static light scattering	1	1
MI:0107	surface plasmon resonance	2	3
MI:0114	x-ray crystallography	5	21
MI:0276	blue native page	1	2
MI:0402	chromatin immunoprecipitation assay	5	24
MI:0411	enzyme linked immunosorbent assay	2	3
MI:0412	electrophoretic mobility supershift assay	1	2
MI:0413	electrophoretic mobility shift assay	1	6
MI:0416	fluorescence microscopy	5	15
MI:0419	gtpase assay	2	4
MI:0423	in-gel kinase assay	1	1
MI:0426	light microscopy	1	1
MI:0663	confocal microscopy	3	6
MI:0676	tandem affinity purification	1	4
MI:0809	bimolecular fluorescence complementation	1	8
MI:0858	immunodepleted coimmunoprecipitation	1	1
MI:0889	acetylase assay	1	1

Methodology

An information retrieval based system for identifying passages that describe an experimental method as evidence for physical PPI is developed (Biocemid, RRID:SCR_014779). The overall workflow of the system is shown in Figure 3. The system pipeline takes a BioC article as input, processes it, and returns the article with the annotated passages for experimental methods in BioC format as output. ‘The BioC Java library’ (https://sourceforge.net/projects/bioc/files/BioC_Java_1.0.1.tar.gz/download) (BioC Java library, Version: 1.0.1, RRID:SCR_014777) (32) is used to read, modify, and re-create the BioC files.

Figure 3.

Overall system workflow.

Open in new tab Download slide

In the preprocessing step a rule-based sentence splitting method, which we developed based on the period followed by a space pattern, is used. The infon types such as ‘title’, ‘table caption’, ‘table’, ‘ref’, ‘footnote’ and ‘front’ are excluded, since the text of some of these infon types are not sentences, but may contain experimental method relevant keywords. In order to reduce the number of false positives (FPs), paragraphs tagged with these infon types are not used for query matching. Moreover, even if paragraphs are tagged with the infon types that we do not exclude, they are not used for query matching if they comprise less than five words. We observe that such short paragraphs with infon types that we do not exclude, in general result due to incorrect tag assignment during BioC format conversion. For example, a header may be tagged with ‘paragraph’ infon instead of ‘title’, and the text may be ‘Pull-down Experiment Results’. The ‘Stanford CoreNLP toolkit’ (http://stanfordnlp.github.io/CoreNLP/index.html and http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip) (Stanford CoreNLP, Version: 3.6.0, RRID:SCR_014778) (39) is used to tokenize the sentences. At the tokenization phase, punctuation marks, braces, left and right parentheses, brackets, digits, floats etc. are removed from the sentences.

Three query matching based algorithms are designed to retrieve passages that describe specific experimental methods. All three algorithms share the same main idea that a query is generated for each experimental method and included in the query table. The queries in the query table are used to match against the paragraphs in the input article to annotate the passages with experimental methods. The article is returned from the pipeline either unchanged or annotated for the passages with the matching experimental methods.

Each algorithm is described in detail in the following sections.

Baseline for query matching. The baseline algorithm defines an initial query for each experimental method by using the names of the experimental method in the PSI-MI ontology. For example, the initial queries for the ‘affinity chromatography technology’, ‘two hybrid’ and ‘pull down’ experimental methods are shown in Table 2. Although ‘pull down’ only has its name without any synonyms in the ontology, ‘affinity chromatography technology’ and ‘two hybrid’ have more than one synonyms. The algorithm uses the terms in the initial queries of experimental methods to detect relevant passages. Although determining the minimum performance line to be improved, the baseline algorithm is also designed to provide a base for the construction of the other two algorithms by expanding the initial queries.

Table 2.

The initial queries for the ‘affinity chromatography technology’ (MI:0004), ‘two hybrid’ (MI:0018) and ‘pull down’ (MI:0096) experimental methods

MI:0004	MI:0018	MI:0096
affinity chromatography technology	two hybrid	pull down
affinity chrom	two-hybrid
affinity purification	yeast two hybrid
	2 hybrid
	2-hybrid
	y2h
	classical two hybrid
	gal4 transcription regeneration
	2h

MI:0004	MI:0018	MI:0096
affinity chromatography technology	two hybrid	pull down
affinity chrom	two-hybrid
affinity purification	yeast two hybrid
	2 hybrid
	2-hybrid
	y2h
	classical two hybrid
	gal4 transcription regeneration
	2h

Table 2.

The initial queries for the ‘affinity chromatography technology’ (MI:0004), ‘two hybrid’ (MI:0018) and ‘pull down’ (MI:0096) experimental methods

MI:0004	MI:0018	MI:0096
affinity chromatography technology	two hybrid	pull down
affinity chrom	two-hybrid
affinity purification	yeast two hybrid
	2 hybrid
	2-hybrid
	y2h
	classical two hybrid
	gal4 transcription regeneration
	2h

MI:0004	MI:0018	MI:0096
affinity chromatography technology	two hybrid	pull down
affinity chrom	two-hybrid
affinity purification	yeast two hybrid
	2 hybrid
	2-hybrid
	y2h
	classical two hybrid
	gal4 transcription regeneration
	2h

The sentences in the paragraphs are matched against the query table of the initial queries for each experimental method. The initial queries contain terms, which can be word unigrams, bigrams or trigrams. If a sentence contains a term from the initial query of an experimental method, the sentence is annotated with that experimental method. If there are successive sentences with the same annotation, they are concatenated under one annotation tag. As a result, sentences or groups of sentences (passages) in paragraphs are annotated for experimental methods.

Two algorithms are developed on top of the baseline for expanding the initial query. The first algorithm is a supervised approach and uses a training set of articles annotated for passages with experimental methods. The most salient query terms are selected based on the tf.rf term weighting metric (40). The second algorithm is an unsupervised approach and utilizes a large unlabeled corpus for query expansion based on the word embeddings of the initial query terms.

tf.rf-based query generation. The texts under the ‘annotation’ tags in the paragraphs (see Figure 2) of the manually annotated BioC articles in our data set were used as input for the tf.rf method. These texts were filtered according to each experimental method, split into sentences, and tokenized. The frequency of each token was calculated and token-frequency tuples were prepared. These tuples were used to calculate the weight of each token with the tf.rf method as follows.

tf . rf = tf * {log}_{2} (2 + \frac{a}{max (1, c)})

(1)

tf is the number of times the token occurs in the passages annotated for the given experimental method (i.e. passages in the positive category), a is the number of passages in the positive category that contain the token, and c is the number of passages in the negative category (i.e. passages annotated with other experimental methods) that contain the token. The intuition behind rf is that a term that occurs more in the positive category compared with the negative category has more discriminating power.

For each experimental method the terms are ranked by their tf.rf weights and manually examined to create the first tier tf.rf and second tier tf.rf term lists and the initial query of that experimental method is expanded by these lists. The first tier tf.rf list consists of high scored relevant tf.rf terms, whereas the second tier tf.rf list consists of lower scored, yet still relevant terms. An example expanded query for the ‘pull down’ experimental method is shown in Table 3. We also investigated selecting the first and second tier term lists automatically. Table 4 shows the expanded query for the ‘pull down’ experimental method generated by selecting the top seven terms based on their tf.rf scores as first tier terms, and the next top seven terms as second tier terms. Similarly, Table 5 shows the expanded query when the top 10 terms based on their tf.rf scores are selected as first tier terms and the next top 10 terms are selected as second tier terms. The names of the experimental methods are excluded from the first and second tier lists even if they have high tf.rf weights, since they are already included in the initial query.

Table 3.