Predicting structured metadata from unstructured metadata

Author Notes

Abstract

Enormous amounts of biomedical data have been and are being produced by investigators all over the world. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data—defined as metadata. We propose a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results on the GEO database show that structured metadata terms can be the most accurately predicted using the TF-IDF approach followed by LDA both outperforming the majority vote baseline. While some accuracy is lost by the dimensionality reduction of LDA, the difference is small for elements with few possible values, and there is a large improvement over the majority classifier baseline. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction.

Introduction

Enormous amounts of biomedical data have been and are being produced by investigators all over the world. This is mainly due to advancements in molecular technologies that have enabled extensive profiling of biological samples and have unleashed a myriad of omics data such as gene expression, microRNA expression, DNA methylation and DNA mutation data. However, during the last decade, journals, investigators, funding agencies, etc. realized that this data should be stored and shared with other investigators. Several databases were created in the process to house this data and make it available to the community at large, such as the NCBI databases for microarray data; Gene Expression Omnibus (GEO) (1) and sequence data; the database of Genotypes and Phenotypes (dbGAP) (2). This data repurposing—the use of data beyond what the original investigators envisioned—has been very fruitful for the biomedical and biological community by allowing meta-analysis studies to discover novel biology hidden in each single dataset.

In this article, we predict the structured metadata from textual metadata elements. With predicting metadata, we aim to improve the quality and quantity of metadata. This improvement for biomedical datasets is crucial to drive the next paradigm shift in data reuse. Tackling the metadata problem will remove hurdles in mapping datasets to each other and time consuming efforts to find datasets for an investigator's disease of interest. It will allow streamlined meta-analysis due to easier discovery of similar datasets owing to harmonized metadata annotations.

Despite the development of community standards to describe an experiment (e.g. Minimum Information About a Microarray Experiment; MIAME) (3) or a dataset (e.g. Health Care and Life Sciences; HCLS dataset description) (4), these are often poorly implemented, making it difficult to find, access and reuse data. However, one crucial and limiting factor in data reuse is accurate, structured and complete description of the data or data about the data—defined as metadata. Overall, current metadata is often of low quality, and many fields are either altogether absent, erroneous or inconsistent. For example, the largest database of gene expression studies, the GEO microarray database, contains over 50 000 studies, more than 1.3 million samples, and is still growing. However, only 15 metadata fields are consistently annotated out of 32 fields. Thousands of records have empty values. The majority of the samples have fewer than four sample level annotations and they suffer from a spectacular lack of consistency (e.g. there are over 30 different ways to specify the ‘age in years’ in the GEO database). It takes time and effort to create well-specified metadata, and investigators view the task of metadata authoring (or data annotation) to be a burden that may benefit other scientists, but not the team that did the work in the first place. There is minimal verification for the correctness of metadata during submission. Because of this, valuable information may not be reusable by other researchers.

Our solution is to leverage textual descriptions for generating structured data based on context and corpus statistics. The main contribution of this paper is to explore whether unstructured gene expression sample metadata contains information, which can be exploited for predicting structured metadata using traditional text mining methods based on term frequency-inverse document frequency (TF-IDF). Furthermore, we will explore whether topic models are able to identify semantically meaningful topics in the unstructured text contained in sample metadata, and whether this topic representation is suitable for training supervised classifiers, with the aim of predicting the values of structured metadata elements compared to the TF-IDF approach.

Topic models are a widely used method for the analysis of textual corpora and other discrete data. We use Latent Dirichlet Allocation (5), one of the most popular topic models, for discovering the latent topics which are present in the metadata of gene expression samples. Topic models provide the opportunity to reduce the description length of the documents as compared to TF-IDF, and they also produce semantically meaningful features in the form of latent topics (5). In the experiments presented in this article, these topics are used as features for training supervised classifiers, with the aim of predicting metadata values. We compare the classification results of topic models to those produced by classifiers that use TF-IDF features to investigate how much discriminatory information is lost by this reduction of document description length.

This article is structured as follows. The ‘Dataset’ section describes the characteristics of the dataset that we used for our experiments, and ‘Methods’ section describes the details of our experimental setup. The results of our experiments are reported in the ‘Results’ section. The ‘Discussion’ section 5 discusses the results and provides directions for future work. The ‘Conclusion’ section concludes this article.

Dataset

For all our experiments, we used data from GEO, a database of gene expression data which contains experimental metadata largely authored by original data submitters. The GEO database contains over 1.3 million records. It contains over 50 000 studies, called ‘series’ in GEO. A series (identified with GSE00001 in GEO) organizes samples into a data set defining an experiment. Each study contains 30 samples on average. A sample (identified with GSE00001 in GEO) describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data (GEO) (1). Note that in some cases samples can belong to multiple series which complicates the evaluation approach (see below). In this article, we used data at the sample level for all studies. Each study is annotated with up to 32 metadata fields representing the conditions under which the sample was handled, the manipulations it underwent and the measurements of each element derived from it. These metadata fields can be either structured or unstructured.

We define a structured element as a metadata element which contains a single concept, such as the organism from which the material was derived. Values of structured elements have the potential to be mapped to concepts contained in a controlled vocabulary. Analogously, we define an unstructured element as a metadata element which contains any other form of textual description provided by the user. The value domain of unstructured elements is free text and cannot be predicted in any meaningful way. Tables 1 and 2 show a list of the structured and unstructured metadata fields, respectively. Next, we define document as the combined unstructured text of one metadata record. Figure 1 shows which terms occur most frequently in these documents.

Figure 1.

Frequent terms in the unstructured text values of GEO metadata values. This figure shows terms which are frequently used in the unstructured elements of the GEO dataset. Colors and size correspond to the frequency of a term.

Open in new tab Download slide

Table 1.

Open in new tab

Structured metadata fields in GEO.

Field	Description
GPL	The unique GEO accession number for the platform (GSE00001). A platform is, essentially, a list of probes that define what set of molecules may be detected, e.g. GPL13653 for the Affymetrix GeneChip Rat Genome U34A Array.
Type	Type of the sample e.g., RNA or DNA.
Organism	The organism(s) from which the biological material was derived, e.g. homo sapiens, drosophila melanogaster.
Molecule	Type of molecule that was extracted from the biological material, e.g. total RNA, cytoplasmic RNA.
Label	The compound used to label the extract, e.g. biotin, Cy3.

Field	Description
GPL	The unique GEO accession number for the platform (GSE00001). A platform is, essentially, a list of probes that define what set of molecules may be detected, e.g. GPL13653 for the Affymetrix GeneChip Rat Genome U34A Array.
Type	Type of the sample e.g., RNA or DNA.
Organism	The organism(s) from which the biological material was derived, e.g. homo sapiens, drosophila melanogaster.
Molecule	Type of molecule that was extracted from the biological material, e.g. total RNA, cytoplasmic RNA.
Label	The compound used to label the extract, e.g. biotin, Cy3.

This table lists the structured metadata fields along with a description of each element.

Table 1.

Open in new tab

Structured metadata fields in GEO.

Field	Description
GPL	The unique GEO accession number for the platform (GSE00001). A platform is, essentially, a list of probes that define what set of molecules may be detected, e.g. GPL13653 for the Affymetrix GeneChip Rat Genome U34A Array.
Type	Type of the sample e.g., RNA or DNA.
Organism	The organism(s) from which the biological material was derived, e.g. homo sapiens, drosophila melanogaster.
Molecule	Type of molecule that was extracted from the biological material, e.g. total RNA, cytoplasmic RNA.
Label	The compound used to label the extract, e.g. biotin, Cy3.

Field	Description
GPL	The unique GEO accession number for the platform (GSE00001). A platform is, essentially, a list of probes that define what set of molecules may be detected, e.g. GPL13653 for the Affymetrix GeneChip Rat Genome U34A Array.
Type	Type of the sample e.g., RNA or DNA.
Organism	The organism(s) from which the biological material was derived, e.g. homo sapiens, drosophila melanogaster.
Molecule	Type of molecule that was extracted from the biological material, e.g. total RNA, cytoplasmic RNA.
Label	The compound used to label the extract, e.g. biotin, Cy3.

This table lists the structured metadata fields along with a description of each element.

Table 2.

Open in new tab

Unstructured metadata fields in GEO.

Field	Description
Title	A unique title that describes the samples.
Source Name	The biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
Treatment Protocol	Treatments applied to the biological material prior to extract preparation.
Extraction Protocol	The protocol used to isolate the extract material.
Label Protocol	The protocol used to label the extract.
Hybridization Protocol	The protocols used for hybridization, blocking and washing, and any post-processing steps such as staining.
Description	Any additional information not provided in the other fields.
Data Processing	Details of how data in the matrix table were generated and calculate.

Field	Description
Title	A unique title that describes the samples.
Source Name	The biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
Treatment Protocol	Treatments applied to the biological material prior to extract preparation.
Extraction Protocol	The protocol used to isolate the extract material.
Label Protocol	The protocol used to label the extract.
Hybridization Protocol	The protocols used for hybridization, blocking and washing, and any post-processing steps such as staining.
Description	Any additional information not provided in the other fields.
Data Processing	Details of how data in the matrix table were generated and calculate.

This table lists the unstructured metadata fields along with a description of each element.

Table 2.

Open in new tab

Unstructured metadata fields in GEO.

Field	Description
Title	A unique title that describes the samples.
Source Name	The biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
Treatment Protocol	Treatments applied to the biological material prior to extract preparation.
Extraction Protocol	The protocol used to isolate the extract material.
Label Protocol	The protocol used to label the extract.
Hybridization Protocol	The protocols used for hybridization, blocking and washing, and any post-processing steps such as staining.
Description	Any additional information not provided in the other fields.
Data Processing	Details of how data in the matrix table were generated and calculate.

Field	Description
Title	A unique title that describes the samples.
Source Name	The biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
Treatment Protocol	Treatments applied to the biological material prior to extract preparation.
Extraction Protocol	The protocol used to isolate the extract material.
Label Protocol	The protocol used to label the extract.
Hybridization Protocol	The protocols used for hybridization, blocking and washing, and any post-processing steps such as staining.
Description	Any additional information not provided in the other fields.
Data Processing	Details of how data in the matrix table were generated and calculate.

This table lists the unstructured metadata fields along with a description of each element.

Methods

In this section, we describe the experimental setup and the methods used for our classification experiments. Our goal is to predict structured metadata values using a classifier trained on the latent topic distributions derived from the unstructured elements and to analyse how well this method performs compared to using the full vocabulary for document description. To accomplish this, we use the documents’ latent topic distributions inferred by LDA, as well as TF-IDF values, as features to represent the documents. For our metadata prediction experiments, we constructed a pipeline (Figure 2) for training and evaluating different supervised classifiers. The experimental setup consists of four main components for training the classifiers and evaluating their predictive performance: (i) pre-processing and splitting the data into a training set and a test set, (ii) training the LDA model, calculating TF-IDF values and training the supervised classifiers, (iii) inferring the per-document topic distributions, calculating TF-IDF values for the test set based on the document frequencies of the training set and predicting the classes and (iv) evaluation of the predictions. Figure 2 shows an overview of this setup, based on LDA features; the setup which uses TF-IDF features is analogous. The experimental setup ensures that no information from the training set is used in evaluation. For evaluation, we compare the performance of the supervised classifiers trained on the topic distributions obtained by LDA with the classifiers trained on TF-IDF features and with the performance of a majority classifier. In the remainder of this section, the individual components of the setup are described in more detail.

Figure 2.

Overview of the experimental setup. This figure shows the four main components of the experimental setup: pre-processing and splitting the data into a training set and a test set, training the LDA model as well as the supervised classifier, inferring the per-document topic distributions and predicting the classes, and evaluation of the predictions. The classification setup for the classifiers using TF-IDF features is analogous: Instead of latent topics, TF-IDF values are used for representing the documents.

Open in new tab Download slide

Pre-processing

Gene Expression Omnibus (GEO) consists of 1 368 682 individual sample records. As a first step, we split the records randomly into a training set consisting of 80% of the records and a test set consisting of the remaining 20%. As previously explained samples belong to a series and a subset of samples can belong to multiple series. To avoid information leakage from the training to the test set, we first identified the superseries that each sample belongs to defined as the grouping of all series that share at least one sample. Then a superseries is either assigned to the train or test set and thus never split up. After the split, both the unstructured text and the values of the structured elements are pre-processed in preparation for the classification experiments.

Unstructuredtext

We applied standard pre-processing methods to the combined values (i.e. unstructured text) of all unstructured elements. After removing English stopwords (https://code.google.com/p/stop-words/) as well as rare and frequent words, the vocabulary consisted of 70 802 distinct terms. Specifically, words which had less than three characters and words which occurred in < 20 documents of the training set were removed, as well as the most frequently occurring 200 words of the training set. Furthermore, documents with less than five words were removed from the dataset. We did not apply stemming as unstemmed words result in more easily interpretable topics and allow the topic model to capture more subtle latent semantics.

Labels

The structured elements of the dataset have a heavily skewed value distribution. As an example, Figure 3 shows the value distribution of the element ‘Organism’ in the training set, after minimal pre-processing (lowercasing and removing special characters). Many values are rare; 1233 values occur <10 times.

Figure 3.

Value distribution of the element ‘Organism’. This figure shows the skewed distribution of values of the element ‘Organism’ (log scale), with the most frequently used value (homo sapiens) occurring 603 446 times and 1233 values occurring <10 times.

Open in new tab Download slide

As supervised classifiers rely on being provided with a sufficiently large set of training examples for each class, we chose to exclude infrequently used element values. Specifically, we excluded values which occurred in less than 0.1% of the dataset records (i.e. values which occurred <1368 times in the dataset). Table 3 shows the number of classes which remained after excluding rare classes, for each of the structured elements.

Table 3.

Open in new tab

Number of classes which constitute at least 0.1% of the dataset.

Element Name	Number of classes	Example Values
GPL	123	gpl570, gpl1261, gpl96, gpl10558
Label	23	biotin, cy3, cy5_and_cy3, alexa_fluor_647
Molecule	5	total_rna, genomic_dna, polya_rna, protein
Organism	33	homo_sapiens, zea_mays, gallus_gallus
Type	7	rna, genomic, sra, protein

Element Name	Number of classes	Example Values
GPL	123	gpl570, gpl1261, gpl96, gpl10558
Label	23	biotin, cy3, cy5_and_cy3, alexa_fluor_647
Molecule	5	total_rna, genomic_dna, polya_rna, protein
Organism	33	homo_sapiens, zea_mays, gallus_gallus
Type	7	rna, genomic, sra, protein

This table shows the number of classes which constitute more than 0.1% of the dataset, as well as example values, for each structured element.

Table 3.

Open in new tab

Number of classes which constitute at least 0.1% of the dataset.

Element Name	Number of classes	Example Values
GPL	123	gpl570, gpl1261, gpl96, gpl10558
Label	23	biotin, cy3, cy5_and_cy3, alexa_fluor_647
Molecule	5	total_rna, genomic_dna, polya_rna, protein
Organism	33	homo_sapiens, zea_mays, gallus_gallus
Type	7	rna, genomic, sra, protein

Element Name	Number of classes	Example Values
GPL	123	gpl570, gpl1261, gpl96, gpl10558
Label	23	biotin, cy3, cy5_and_cy3, alexa_fluor_647
Molecule	5	total_rna, genomic_dna, polya_rna, protein
Organism	33	homo_sapiens, zea_mays, gallus_gallus
Type	7	rna, genomic, sra, protein

This table shows the number of classes which constitute more than 0.1% of the dataset, as well as example values, for each structured element.

Model training

The training pipeline first trains a topic model on the unstructured text from the training set and subsequently trains a supervised classifier on the features resulting from the topic model. For comparison, the same type of classifier is trained on TF-IDF features.

Topic model

The first step of the training pipeline trains a topic model on the unstructured text of the training set. The topic model which we use for our experiments is Latent Dirichlet Allocation (LDA). This is a generative Bayesian model which was introduced by Blei et al. (5). LDA requires the number of latent topics to be set in advance, as well as the specification of two hyper-parameters: α for smoothing the document-topic distribution and β for smoothing the topic-term distribution. We built an LDA model with 150 topics on the unstructured text of the training set. This number of topics was chosen based on a document description length which is longer than the number of different values in the most diverse structured element (GPL). The hyper-parameters α and β were set to 0.1 and 0.01, respectively. These relatively low values indicate our prior beliefs that a sample consists of a mixture of relatively few topics, and that the topics in this dataset contain a mixture of a few, specific words (rather than many words of equal importance). A total of 200 iterations of Collapsed Variational Bayes inference (6) were used to train the model. In the model, each document is represented by a distribution over all 150 topics, and each topic is represented by a distribution over all terms in the vocabulary. This means that the dimensionality of the feature space, compared to representation methods such as term frequency or TF-IDF, was reduced by almost 98% (from 70 802 to 150).

TF-IDF

With the document frequencies of the training set, TF-IDF values are calculated for all documents in the training set.

Supervisedclassifiers

The document-topic distributions resulting from the LDA model, as well as the TF-IDF values, were used as features for representing the documents in the supervised classification part of our experiments. For classification, we used a support vector machine (SVM) with linear kernel for each of the structured elements (7). Additionally, we used a majority classifier as a baseline. The classifiers were trained on the records from the training set. The linear SVM was trained using the default value for the penalty parameter C = 1.0. All classifiers were trained and tested using the Python library scikit-learn (http://scikit-learn.org/stable/). The LDA model was built using the Stanford Topic Modeling Toolbox (http://nlp.stanford.edu/software/tmt/tmt-0.4/).

Prediction pipeline

The prediction pipeline first uses the trained topic model to infer the document-topic distributions of the records contained in the test set, and then uses these inferred distributions for predicting the classes with the supervised classifier. To analyse how much information was lost during the dimensionality reduction, TF-IDF features are also used for prediction.

Topicmodel

For inferring the topic distribution of new, unseen documents, the topic-term distribution learned by the topic model from the training data was used. It remains unchanged as the document-topic distribution for the new documents is inferred.

TF-IDF

With the document frequencies from the training set, TF-IDF values are calculated for all documents in the test set.

Supervisedclassifiers

The inferred document-topic distributions, as well as the TF-IDF values, were used as a representation of the documents contained in the test set, for the prediction performed by classifiers which were trained as described above. The trained supervised classifiers were then used to predict the classes of the records from the test set.

Evaluation

For evaluating the classifiers, we compare the performance of the different classification models by using the standard evaluation measures precision, recall and F1-Score, defined as the harmonic mean of precision and recall. The metrics we report are class averages, weighted by the support of the class (i.e. the number of occurrences of this class in the test set). Each prediction is compared to the actual class value of our ground truth (i.e. the original class of the respective sample). Additionally, we calculated the metrics for a baseline classifier in order to obtain a fair evaluation of the prediction results. For this baseline, we used a majority classifier, which predicts the majority (i.e. most frequent) class for each record. All code developed for this project is available at github (https://git.gesis.org/lisa.posch/geo-metadata-prediction).

Results

This section describes the latent topics identified by LDA, as well as the results of the supervised classification experiments.

GEO metadata topics

In LDA, a latent topic is defined by a probability distribution over the vocabulary. Words in a topic can therefore be ranked by the probability which they have in this topic, in order to visualize a topic by showing its most representative terms. Figure 4 depicts six topics from the 150 topics, showing the top 10 words for each topic (refer Supplementary data for a description of all topics). The size of the words corresponds to the importance of the word in the topic.

Figure 4.

Top words for six sample topics. This figure shows the top 10 words for six of the 150 topics that were learned by LDA. Color and size of the words correspond to the relative importance of a word in the topic.

Open in new tab Download slide

Metadata prediction

The performance of the SVM with linear kernel classifier, for both feature sets (latent topics and TF-IDF), as well as the performance of the majority classifier (MC) baseline are depicted in Figure 5. The best classification results were achieved for the elements with the smallest number of classes. The element with the highest number of classes, GPL, had the lowest absolute scores, but a large improvement over the baseline. During the dimensionality reduction of 98% by LDA, information is lost compared to using the full vocabulary as features for training the classifiers. However, the performance on elements with few values is similar as that of the classifiers trained on the full vocabulary. The baseline classifier is outperformed in all cases, both by the classifiers using TF-IDF features and the classifiers using topic features.

Figure 5.

Evaluation results. This figure shows weighted class averages for precision, recall and F1-Score for each structured element. Results are reported for linear SVM with LDA features, linear SVM with TF-IDF features and for the majority classifier (MC) baseline. While some information was lost during LDA’s dimensionality reduction (by 98%), both approaches performed better than the baseline for all elements.

Open in new tab Download slide

Discussion

The results of our experiments indicate that unstructured metadata contains useful information for predicting structured metadata. This approach, which is not customized to the domain or specific dataset, is significant because elaborate descriptions are widely used in biomedical resources, while structured metadata is often incomplete, missing, inconsistent and even incoherent. Our results show that both TF-IDF and LDA models generated from textual descriptions in combination with supervised classification are an effective strategy to predict values for some structured metadata elements. TF-IDF on the full vocabulary resulted in the best performance. However this representation has the disadvantage of resulting in a very large dataset (over 70 000 features per document) that is not well semantically interpretable. Overall, our results offer a promising approach for metadata prediction that is likely to be applicable to other datasets, but more work is needed to assess its utility for different datasets.

Our classification setup achieved a better predictive performance than the baseline for all elements, despite the baseline, which predicts the majority class, being relatively high due to the heavily skewed class distribution of most elements. The goal of this work was not only to find the best supervised classifier for this task, but also to explore whether structured metadata elements can be predicted using the information from unstructured elements and whether topic distributions can successfully be used for metadata prediction in this domain.

One limitation of our experiments is that we did not use all values of the structured elements, but decided to exclude values which were used very infrequently. Future work is needed to investigate whether infrequently used values (<0.1% of the dataset) can be accurately predicted as well. In the experiments presented in this article, a fixed number of topics as well as fixed hyper-parameters were used, as LDA requires these parameters to be specified. An optimal number of topics and optimal hyper-parameters will likely lead to better prediction results. Therefore, future work is needed to further optimize these parameters. One planned approach is to use a non-parametric topic model using practical collapsed stochastic variational inference (8), as a memory friendly inference strategy that optimizes the number of topics as well as the hyper-parameters. Furthermore, LDA has been adapted and extended in many ways (9, 10). For example, a supervised form of LDA could be used which additionally takes into account the global popularity of the structured element values, in the form of a prior for the model.

Our work is highly relevant to both prospective and retrospective augmentation of metadata. In particular, the approach we describe here suggests that metadata authors and biocurators could be aided with accurate suggestions thereby reducing the time and effort to create rich, high quality metadata and help them scale to the increasing amount of submitted biomedical data. Furthermore, the predictions may help with correcting existing, wrongly entered element values: detected discrepancies between the element value entered by a user and the predicted value could be given to human curators for review. In line with the goals of the Center for Data Annotation and Retrieval (CEDAR) (11), we aim to incorporate this framework to improve the authoring of metadata. CEDAR is partnered with a number of community groups to further develop and evaluate our technologies including ImmPort, the data warehouse of immunology-related datasets led by Stanford (12), The Human Immunology Project Consortium (13), which channels experimental datasets to the ImmPort repository, The BioSharing Initiative (14) and The Stanford Digital Repository (15), operated by Stanford University Libraries, which helps all Stanford investigators to archive and disseminate data of all kinds. We plan to extend our context-sensitive recommendation engine (16) with the results of this work in order to provide improved real time support to metadata authors, particularly those that are following curation guidelines (17). Using a constellation of different technologies, such as what is proposed here, we will increase the accuracy of metadata values during submission, decrease the missing value problem and increase the consistency of metadata.

Conclusions

We have developed a framework for the prediction of structured metadata, using unstructured text. We use TF-IDF and a topic model approach on the unstructured text contained in metadata records. These TF-DF values and the latent topics are then used as features for training supervised classifiers. The results of our experiments indicate that unstructured metadata elements contain information which can be successfully exploited for predicting structured metadata elements. Furthermore, our results suggest that topic models can infer meaningful topics from the unstructured text of metadata records, and that these topics can be successfully used as features for training supervised classifiers. While some discriminatory information is lost compared to using the full vocabulary as features for training the classifiers, the performance on elements with few values is close to that of the classifiers trained on the full vocabulary. Our work has implications for researchers and practitioners interested in biomedical metadata curation and metadata prediction.

Funding

CEDAR, The Center for Expanded Data Annotation and Retrieval (U54 AI117925) awarded by the National Institute of Allergy and Infectious Diseases through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflict of interest. None declared.

References

Barrett

Wilhite

S.E.

Ledoux

et al. . (

2013

)

NCBI GEO: archive for functional genomics data sets - update

Nucleic Acids Res

991

–

995

Google Scholar

Crossref

WorldCat

Tryka

K.A.

Hao

Sturcke

et al. . (

2014

)

NCBI’s database of genotypes and phenotypes: DbGaP

Nucleic Acids Res

975

–

979

Google Scholar

Crossref

WorldCat

Brazma

(

2009

)

Minimum Information About a Microarray Experiment (MIAME)–successes, failures, challenges

ScientificWorldJournal

420

–

423

Gray

A.J.G.

Baran

Marshall

M.S.

Dumontier

(

2015

) Dataset Descriptions: HCLS Community Profile. http://www.w3.org/TR/hcls-dataset/ (6 November 2015, date last accessed].

Blei

D.M.

A.Y

Jordan

M.I.

(

2003

)

Latent dirichlet allocation

J. Mach. Learn. Res

993

–

1022

Google Scholar

OpenURL Placeholder Text

WorldCat

Teh

Y.W.

Newman

Welling

(

2007

)

A collapsed variational bayesian inference algorithm for latent dirichlet allocation,”

Adv. Neural Inf. Process. Syst

1353

Google Scholar

OpenURL Placeholder Text

WorldCat

Zhang

(

2001

)

An introduction to support vector machines and other kernel-based learning methods

AI Mag

103

Google Scholar

OpenURL Placeholder Text

WorldCat

Bleier

(

2013

) Practical collapsed stochastic variational inference for the HDP. In: NIPS Workshop on Topic Models: Computation, Application, and Evaluation.

Posch

Bleier

Schaer

Strohmaier

The Polylingual Labeled Topic Model. In: Proceedings of the KI 2015: Advances in Artificial Intelligence—38th Annual German Conference on AI, Dresden, Germany,

2015

, pp. 295–301.

Chemudugunta

Holloway

Smyth

Steyvers

(

2008

)

Modeling documents by combining semantic concepts with unsupervised statistical learning

Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)

5318

229

–

244

Google Scholar

OpenURL Placeholder Text

WorldCat

Musen

M.A.

Bean

C.A.

Cheung

K.H.

et al. . (

2015

)

The Center for Expanded Data Annotation and Retrieval.

J. Am. Med. Inform. Assoc

1148

–

1152

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Bhattacharya

Andorf

Gomes

et al. . (

2014

)

ImmPort: disseminating data to the public for the future of immunology

Immunol. Res

234

–

239

Brusic

Gottardo

Kleinstein

Davis

(

2014

)

HIPC steering committee. Computational resources for high-dimensional immune analysis from the Human Immunology Project Consortium

Nat Biotechnol

146

–

148

Field

Sansone

Delong

E.F.

et al. . (

2010

)

Meeting report: BioSharing at ISMB 2010

Stand. Genomic Sci

254

–

258

Stanford Digital Repository. https://library.stanford.edu/research/stanford-digital-repository (6 November

2015

, date last accessed].

Panahiazar

Dumontier

Gevaert

Context Aware Recommendation Engine for Metadata Submission. In:

First International Workshop on Capturing Scientific Knowledge

Palisades

2015

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Dai

Tian

et al. . (

2013

)

AuthorReward: Increasing community curation in biological knowledge wikis through automated authorship quantification

Bioinformatics

1837

–

1839

Author notes

Citation details: Posch,L., Panahiazar,M., Dumontier,M. et al. Predicting structured metadata from unstructured metadata. Database (2016) Vol. 2016: article ID baw080; doi:10.1093/database/baw080

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
December 2016	2
January 2017	5
February 2017	13
March 2017	16
April 2017	22
May 2017	17
June 2017	15
July 2017	10
August 2017	14
September 2017	2
October 2017	14
November 2017	8
December 2017	14
January 2018	25
February 2018	37
March 2018	31
April 2018	29
May 2018	94
June 2018	21
July 2018	38
August 2018	26
September 2018	32
October 2018	21
November 2018	24
December 2018	15
January 2019	14
February 2019	27
March 2019	29
April 2019	19
May 2019	28
June 2019	18
July 2019	26
August 2019	40
September 2019	101
October 2019	65
November 2019	25
December 2019	21
January 2020	13
February 2020	16
March 2020	17
April 2020	11
May 2020	15
June 2020	103
July 2020	59
August 2020	34
September 2020	18
October 2020	15
November 2020	13
December 2020	33
January 2021	12
February 2021	20
March 2021	31
April 2021	20
May 2021	25
June 2021	23
July 2021	14
August 2021	31
September 2021	17
October 2021	30
November 2021	23
December 2021	7
January 2022	14
February 2022	6
March 2022	32
April 2022	18
May 2022	9
June 2022	11
July 2022	8
August 2022	23
September 2022	26
October 2022	23
November 2022	26
December 2022	41
January 2023	12
February 2023	12
March 2023	11
April 2023	16
May 2023	19
June 2023	24
July 2023	6
August 2023	18
September 2023	21
October 2023	20
November 2023	17
December 2023	23
January 2024	43
February 2024	59
March 2024	28
April 2024	23
May 2024	25
June 2024	22
July 2024	19
August 2024	26
September 2024	19
October 2024	23
November 2024	27
December 2024	7
January 2025	14
February 2025	5
March 2025	16
April 2025	40
May 2025	14
June 2025	44
July 2025	5

Article Contents

Predicting structured metadata from unstructured metadata

Abstract

Introduction

Dataset

Methods

Pre-processing

Unstructuredtext

Labels

Model training

Topic model

TF-IDF

Supervisedclassifiers

Prediction pipeline

Topicmodel

TF-IDF

Supervisedclassifiers

Evaluation

Results

GEO metadata topics

Metadata prediction

Discussion

Conclusions

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Predicting structured metadata from unstructured metadata Open Access

Abstract

Introduction

Dataset

Methods

Pre-processing

Unstructuredtext

Labels

Model training

Topic model

TF-IDF

Supervisedclassifiers

Prediction pipeline

Topicmodel

TF-IDF

Supervisedclassifiers

Evaluation

Results

GEO metadata topics

Metadata prediction

Discussion

Conclusions

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Predicting structured metadata from unstructured metadata