The Microbe Directory: a centralized database for biological interpretation of microbiome data

Abstract

The Microbe Directory (TMD) is a centralized database of metadata for microbes from all domains that helps with the biological interpretation of metagenomic data. The database comprises phenotypical and ecological traits of microorganisms, which have been verified by independent manual annotations. This effort has been possible by the help of a community of volunteer students worldwide who were trained in manual curation of microbiology data. To summarize this information, we have built an interactive browser that makes the database accessible to everyone, including non-bioinformaticians. We used the TMD data to analyse microbiome samples from different projects such as MetaSUB, TARA Oceans, Human Microbiome Project, and Sponge Microbiome Project, showcasing the utility of TMD. Furthermore, we compare our microbial annotations with annotations collected by artificial intelligence (AI) and demonstrate that despite the high speed of AI in reviewing and collecting microbial data, annotation requires domain knowledge and therefore manual curation. Collectively, TMD provides a unique source of information that can help to interpret microbiome data and uncover biological associations.

Database URL: www.themicrobedirectory.com/

Introduction

The development and use of databases in science is not a recent phenomenon. In recent decades, databases have become a cornerstone of biological research, largely due to advances in molecular techniques [1], which have enabled access to the hundreds to millions of letters that make up a genome.

The first sequenced genome—a bacteriophage—completed by Fredrick Sanger in 1977 [2], was published in its entirety within the manuscript, as no public storage repositories existed at the time to house the 5386-base pair (bp) genome. As a comparison, if the human genome had instead been printed, it would have required more than 1.5 million paper pages to contain the 3.1 billion bp sequence. The sequencing of large genomes, especially the Human Genome Project (HGP), highlighted the need of a centralized method to store sequence information that could be easily accessed, and shareable among research teams worldwide [3, 4].

Today, the scale of sequencing has expanded dramatically, particularly in the field of metagenomics. To date, more than 32,000 research articles describing metagenomes have been published in PubMed. Most of these studies represent newly sequenced metagenomes—a compilation of genomes that likely represent hundreds or thousands of species. Manual investigation of these complex datasets is time-consuming and may lead to a never-ending data curation process. This represents an immediate challenge for researchers attempting to interpret biological data and generate biological hypotheses from sequencing results.

The Microbe Directory (TMD) database was developed as a necessity for a free, user-friendly, and centralized database of microbial features and ecological characteristics. It is designed to be accessible to students, scientists, and clinicians. TMD is not a sequence repository, nor a taxonomic classification tool. Instead, TMD compiles metadata of microbes from all domains of life to aid with the interpretation of metagenomic outputs. By manual annotation and curation, TMD has compiled information to date on almost 18,000 species of bacteria (5201), archaea (873), algae (4209), fungi (3926), and viruses (3,596). The database provides information on 42 microbial features such as the type of genetic material in viruses and the type of Gram-staining for bacteria, as well as ecological characteristics such as the source of the microbiome (e.g. water or soil) or the ability to withstand extreme conditions (i.e. extremophiles). This represents a significant upgrade from the first two versions of the database [5, 6], with a more comprehensive dataset, extended annotations, and a user-friendly data interface that empowers users to navigate and utilize the database effectively.

Lastly, we underscore the importance of manual annotations in TMD, comparing them with annotations generated by artificial intelligence (AI). Despite the recent strides made in AI, our findings demonstrate the continued necessity for a domain of knowledge and, consequently, manual curation in creating and maintaining databases like TMD.

Methods

Volunteers training and data curation

To build the database, manual annotations were performed during the years 2019–22 with the help of almost 400 volunteers. Most of these volunteers were undergraduate students, with some high school and graduate students also participating. The students were recruited primarily through postings on social media and word-of-mouth among students. Volunteers joined the project from various cities in the USA, as well as other countries, including India, Colombia, Ecuador, and Brazil. The selection criteria included only a basic knowledge in microbiology, as the project aimed to provide students with tools and opportunities for remote learning.

All volunteers were trained to read articles, search on repositories, and identify information relevant for the annotation process by Zoom training sessions, video tutorials, and documented guide. This method ensured that all volunteers had an understanding of the research articles, the questions queried in the database, the microbial annotation, and submission process. Each week, the volunteers received a list of 10 microbial species by email and spent on average 8–13 h per week researching the species assigned, and submitting the information queried. Volunteers researched information of microbes from different articles and databases such as PubMed, the American Type Culture Collection (ATCC), MycoBank [5], Global Biodiversity Information Facility (GBIF) [6], eHOMD [7], GOLD [8], BacDive [9], and Diatoms.org [10], along with other openly accessible sources listed in the provided microbiology guides.

To compile information, volunteers submitted their microbial annotations in a customized survey in KoboToolBox [11], Supplementary Fig. 1. Information compiled by each volunteer was later summarized into a single file and cleaned from typos or redundancies using a custom script (see GitHub in the Data availability section).

Data compilation and processing

A list of all microbial species names, including bacteria, archaea, virus, fungi, and algae, were downloaded from NCBI taxonomy in the following link: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ version 2020-07-01 in a taxdump format. The format contains the full taxonomic structure used by NCBI for classifying organisms. It is widely used in bioinformatics for mapping organism names to their taxonomy, navigating the taxonomic hierarchy, map organism names to NCBI taxon IDs, and normalizing organism names across databases [12].

The species list was screened for redundant names using a list of exclusion words for each microbial group using a custom script available in our GitHub (https://github.com/mariaasierra/themicrobedirectory). The complete list of exclusion terms for each microbial group can be found in Supplementary Table 1.

Eukaryotic microorganisms do not differ taxonomically to multicellular organisms. For example, the microalgae Chlamydomonas sp. and macroalgae, Ulva sp. also known as sea lettuce, are within the same phylum clade Chlorophyta [13]. Therefore, species lists downloaded using the taxdump NCBI format included species of macroorganisms. To compile a list of only microbial eukaryotes, macrofungi and macroalgae were removed from the species list based on a literature review.

To compile the list of microfungi, known genera to comprise only macrofungi (with fruiting body) were removed from the species list retrieved from NCBI taxonomy. All genera of fungi were kept, except Chlorociboria, Paxina, Helvella, Morchella, Kalaharituber, Peziza, Pseudohelotium, Terfezia, Anthracobia, Isaria, Scutellinia, Rhizina, Phillipsia, Tuber, Choiromyces, Cordyceps, Daldinia, Poronia, Xylaria, Penzigia, Agaricus, Arachnion, Battarrea, Battarreoides, Bovista, Calvatia, Chlamydopus, Chlorophyllum, Coniolepiota, Coprinellus, Coprinopsis, Coprinus, Crucibulum, Disciseda, Gyrophragmium, Langermannia, Lepiota, Leucoagaricus, Lycoperdon, Macrolepiota, Montagnea, Montagnites, Mycenastrum, Parasola, Polyplocium, Psalliota, Secotium, Tulostoma, Xanthagaricus, Amanita, Amanita, Saproamanita, Bolbitius, Conocybe, Galeropsis, Pluteolus, Broomeia, Clavaria, Clavaria, Clavulinopsis, Mucronella, Cortinarius, Locellina, Chondrostereum, Cyphella, Claudopus, Clitopilus, Entoloma, Fistulina, Laccaria, Hygrocybe, Hygrophorus, Anellaria, Panaeolina, Panaeolus, Astrosporina, Crepidotus, Inocybe, Phaeoglabrotricha, Phaeosolenia, Lyophyllum, Podabrella, Termitomyces, Calyptella,Marasmius, Solenia, Cruentomycena, Favolaschia, Mycena, Flagelloscypha, Lachnella, Cyathus, Anthracophyllum, Gymnopus, Marasmiellus, Omphalotus, Phellorinia, Armillaria, Armillariella, Cyptotrama, Oudemansiella, Hymenopellis, Physalacria, Xerula, Pleurotus, Pluteus, Volvariella, Podaxis, Ozonium, Psathyrella, Pterula, Schizophyllum, Sebacina, Agrocybe, Deconica, Flammula, Galera, Gymnopilus, Hebeloma, Hymenogaster, Hypholoma, Kuehneromyces, Leratiomyces, Naucoria, Pholiota, Psilocybe, Stropharia, Tubaria, Amparoina, Cellypha, Clitocybe, Collybia, Lepista, Macrocybe, Melanoleuca, Omphalia, Tricholoma, Tricholomopsis, Tricholosporum, Trogia, Auricularia, Eichleriella, Exidia, Heterochaete, Aporpium, Aureoboletus, Boletus, Buchwaldoboletus, Chalciporus, Imleria, Leccinum, Octaviania, Xerocomellus, Phlebopus, Coniophora, Gyrodontium, Gyroporus, Melanogaster, Paxillus, Rhizopogon, Pisolithus, Scleroderma, Serpula, Suillus, Cantharellus, Pellicularia, Clavulina, Corticium, Cytidia, Dendrothele, Laetiporus, Tretopileus, Geasteropsis, Geastrum, Geastrum, Anthurus, Aseroe, Blumenavia, Clathrella, Clathrus, Ileodictyon, Itajahya, Jaczewskia, Kalchbrennera, Kalchbrennera, Mutinus, Phallus, Daedalea, Fomitopsis, Gloeocystidium, Phaeolus, Rhodofomitopsis, Amauroderma, Ganoderma, Gloeophyllum, Ramaria, Clavariadelphus, Coltricia, Fomitoparia, Fuscoporia, Hydnum, Hymenochaete, Phellinus, Polystictus, Trichaptum, Cotylidia, Grandinia, Heterochaete, Oxyporus, Crustodontia, Acia, Aegerita, Bjerkandera, Cymatoderma, Gloeoporus, Irpex, Laschia, Merulius, Mycoleptodon, Odontia, Phlebia, Podoscypha, Pseudolagarobasidium, Abortiporus, Coriolopsis, Coriolus, Daedaleopsis, Favolus, Fomes, Funalia, Grammothele, Heliocybe, Hexagonia, Lentinus, Lenzites, Lenzites, Lenzites, Lignosus, Microporus, Nigroporus, Neolentinus, Panus, Perenniporia, Picipes, Phellinus, Polyporus, Pycnoporus, Trametes, Lentinellus, Dentipellicula, Laxitextum, Asterostroma, Dichostereum, Lachnocladium, Peniophora, Lactarius, Lactifluus, Russula, Aleurodiscus, Stereum, Hypochnus, Thelephora, Arrhytidia, Calocera, Dacrymyces, Dacryopinax, Femsjonia, Naematoloma, Tremella, Phaeotremella, Sirobasidium, Pilobolus, Echinostelium, Cribraria, Dictydiaethalium, Licea, Lycogala, Reticularia, Tubifera, Diachea, Diderma, Didymium, Mucilago, Badhamia, Badhamiopsis, Craterium, Fuligo, Leocarpus, Physarella, Physarum, Willkommlangea, Amaurochaete, Comatricha, Enerthenema, Lamproderma, Stemonaria, Stemonitis, Stemonitopsis, Calomyxa, Arcyria, Hemitrichia, Metatrichia, Oligonema, Perichaena, Trichia, Ceratiomyxa, and C eratium.

To compile only microalgae, all taxa from the divisions Chlorophyta, Streptophyta, Rhodophyta, and Glaucophyta were kept, except Leiosporocerotopsida, Sphagnopsida, Ginkgoopsida, Andreaeobryopsida, Takakiopsida, Gnetopsida, Andreaeopsida, Haplomitriopsida, Pinopsida, Bryopsida, Jungermanniopsida, Magnoliopsida, Charophyceae, Oedipodiopsida, Marchantiopsida, Lycopodiopsida, Polytrichopsida, Polypodiopsida, Anthocerotopsida, Tetraphidopsida, and Cycadopsida.

Additionally, the type of eukaryote macro- or micro- was added to the annotation process done by the volunteers.

To add an extra layer of corroboration to our curation process, we assigned microbial species to, at least, three different randomized volunteers. Responses were compared and a majority-voting system was implemented. Answers where there was not a consensus response were labelled as ‘Incongruent’ (e.g. Gram-staining: positive and negative). In questions where multiple choices were allowed (e.g. microbiome source: soil and urban environments), all answers were kept and unified. See Supplemental Fig. 2 for an example.

Cleaning of the data compiled by volunteers was performed using custom scripts to remove typos in scientific names (see GitHub). Additionally, data from previous versions of TMD database [14, 15] were standardized with the newest microbial features and merged into the current dataset. Cleaned data files used to build the database can be found in GitHub (see the Data availability section).

Data generation with Gemini artificial intelligence

To compare the reliability and expertise of the manual annotation effort versus the speed of AI to annotate microbial characteristics, Gemini was tested using a series of prompts in Google Gemini AI studio v1.5 Flash (https://gemini.google.com/app) using the following link: https://ai.google.dev/aistudio. The prompts were submitted for archaea, bacteria, virus, microfungi, and microalgae. The features that were asked for bacteria and archaea included oxygen use, gram-stain or gram-staining, biofilm formation, spore formation, extremophile, type of extremophile, microbiome, metabolism, and pathogenicity. For virus, genetic material, strand, sense, capsid symmetry, envelope, pathogenicity, and host. For diatoms, type of growth, morphology, biofilm formation, UV resistance, extremophile, type of extremophile, lifestyle, symmetry, toxicity, antagonistic activity, and symbiosis.

The prompts submitted included the following statements: ‘Access PubMed and filter research papers in the field of microbiology that describe [domain] and mention any of the following characteristics: [feature 1, feature 2, feature 3, feature 4, etc.]’. Additionally, some facts were added to the prompts to guide Gemini while compiling the microbial data. The facts included ‘Archaea are not bacteria’ and ‘Gram-staining can be positive or negative’. To view the full list, please refer to Supplementary information. The prompt for each microbial group was submitted at least three times to get as many species entries as possible. No changes between resubmission of the prompts were made. After the third trial ~70% of the species names were duplicated, and the search effort was stopped.

Additionally, the free version of ChatGPT (version GPT-4o) by OpenAI [16] was tested to perform annotations with the prompts previously described for Gemini AI. However, ChatGPT was unable to directly access PubMed or other external databases for research papers. Since the purpose of the comparisons with TMD is to provide free annotations, we did not try the ChatGPT PLUS version.

External datasets

To demonstrate the utility of TMD database, four publicly available whole genome sequencing (WGS) datasets were downloaded: The Human Microbiome Project (HMP) [17], MetaSUB [18], TARA Oceans [19], and Sponge Microbiome Project (SMP) [20, 21]. For HMP, 19 randomized WGS samples were selected, including male and females. These samples comprised different body sites such as gingiva, gastrointestinal tract, right retroauricular crease, throat, and vagina (Supplementary Table 2). From the MetaSUB project, 24 WGS samples were downloaded. These included nine cities in nine countries with at least two different samples per city. Samples were selected semi-randomly to comprise beach and no-beach cities (i.e. beach city = Barcelona, Spain vs. no beach city = Berlin, Germany). From the TARA Oceans dataset, 10 samples from the Polar circle region were randomly downloaded. Lastly, from the SMP, two randomized samples from three sponge species from the genus Ircinia were selected (n = 6 total). Accession numbers for all samples can be found in Supplementary Table 2.

Raw FASTQ files from all datasets were downloaded and processed with the same computational pipeline, including taxonomic classification with Kraken2 with a custom microbial genome database that include all microbial genomes in RefSeq. Statistical analysis and comparisons with the TMD database were performed with custom R scripts, which can be found in GitHub: https://github.com/mariaasierra/themicrobedirectory (see the Data availability section).

Table 1.

Open in new tab

Comparison of the Microbe Directory database and other publicly available microbial databases. Some of the characteristics that are included in TMD are individually found in some databases. [8, 9, 22–40]

			Reference publication	Database publication year	the microbe directory
Database	Description	Link database			Bacteria Archaea	Fungi	Algae	Virus	Gram- Stain (Bacteria)	Microbiome	Metabolism	Pathogen	Host	Human body site	Spore forming	Oxygen use	Biofilm forming	Extrem-ophile	Anti-microbial Resistance	Genetic material (Virus)	Strand (Virus)	Envelop (Virus)	Capsid (Virus)
BACMAP Genome Atlas	BacMap is an interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomes	http://bacmap.wishartlab.com	[22]	2005	×				×	×	×	×	×
ATCC	American Type Culture Collection collects, stores, and distributes standard reference microorganisms	https://www.atcc.org/About	[23]	1927	×				×		×	×						×	×	×
VFDB	Virulence factors of pathogenic bacteria	http://www.mgc.ac.cn/cgi-bin/VFs/v5/main.cgi	[24]	2004	×				×		×	×	×			×
BV-BRC	Bacterial and Viral Bioinformatics Resource Center for bacterial and viral infectious diseases	https://www.bv-brc.org	[25]	2023	×			×	×			×	×						×		×
PHIDIAS	The Pathogen-Host Interaction Data Integration and Analysis System (PHIDIAS) for pathogens with high priority in public health and biological defense.	https://phidias.us/introduction.php	[26]	2007	×				×			×			×	×				×	×	×
eHOMD	Human Oral Microbiome Database provides a comprehensive curated information on the bacterial species present in the human aerodigestive tract	http://www.homd.org	[27]	2010	×				×	×		×		×	×
BacDive	The Bacterial Diversity Metadatabase	https://bacdive.dsmz.de/advsearch	[9]	2013	×				×	×	×	×	×	×	×	×		×	×
GOLD	Genomes Online Database	https://gold.jgi.doe.gov/organisms	[8]	1997	×	×	×	×	×		×				×	×
GBIF	Global Biodiversity Information Facility. Free and open access to biodiversity data	https://www.gbif.org/species/search	[28]	2001	×	×	×	×		×
ThermoBase	A Centralized Database of Thermophiles and Hyperthermophiles	https://exprotdb.com/organisms_browse.php	[29]	2022	×	×				×	×					×		×
HaloDom	Halophilic organisms across all domains	http://halodom.bio.auth.gr/?view=all_data	[30]	2017	×	×	×											×
BaAMPs	Database dedicated to antimicrobial peptides (AMPs) specifically tested against microbial biofilms.	http://www.baamps.it	[31]	2015	×												×
U.S National Fungus Collections	USDA Agricultural Research Service	https://fungi.ars.usda.gov	[32]	1890		×							×
Microfungi Collection Consortium	The Microfungi Collections Consortium (MiCC) is a collaborative effort to digitize specimen label data from North American microfungi	https://www.microfungi.org	NA	2014		×				×
Mycology Collection Portal	MyCoPortal is data derived from a network of universities, botanical gardens, museums, and agencies that provide taxonomic, environmental, and specimen-based information	https://www.mycoportal.org/	[33]	2012		×				×
EmsemblFungi	Ensembl Fungi is a browser for fungal genomes.	https://fungi.ensembl.org	[34]	2002		×						×	×		×
UNITE	Database and sequence management environment centered on the eukaryotic nuclear ribosomal ITS region	https://unite.ut.ee/index.php	[35]	2003		×	×						×
FungalTraits	Traits database of fungi and fungus-like stramenopiles	https://docs.google.com/spreadsheets/d/1cxImJWMYVTr6uIQXcTLwK1YNNzQvKJJifzzNpKCM6O0/edit?usp=sharing	[36]	2019		×			×			×
ViruSITE	Integrated database for viral genomics	http://www.virusite.org/index.php?nav=browse	[37]	2016				×					×							×	×
IRAM	Virus Capsids Database & Analysis Resources	NA	[38]	2019				×															×
VIPERdb	Virus Particle Explorer: a web interface for exploring virus structures	https://viperdb.org	[39]	2001				×					×							×	×		×
Virus-Host DB	Organizes data about the relationships between viruses and their hosts	https://www.genome.jp/virushostdb/	[40]	2016				×				×	×							×	×
Diatoms.org	Documents diversity of diatoms in North America, providing information on diatom identification, ecology and distribution	https://diatoms.org/	[10]	2022			×			×

			Reference publication	Database publication year	the microbe directory
Database	Description	Link database			Bacteria Archaea	Fungi	Algae	Virus	Gram- Stain (Bacteria)	Microbiome	Metabolism	Pathogen	Host	Human body site	Spore forming	Oxygen use	Biofilm forming	Extrem-ophile	Anti-microbial Resistance	Genetic material (Virus)	Strand (Virus)	Envelop (Virus)	Capsid (Virus)
BACMAP Genome Atlas	BacMap is an interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomes	http://bacmap.wishartlab.com	[22]	2005	×				×	×	×	×	×
ATCC	American Type Culture Collection collects, stores, and distributes standard reference microorganisms	https://www.atcc.org/About	[23]	1927	×				×		×	×						×	×	×
VFDB	Virulence factors of pathogenic bacteria	http://www.mgc.ac.cn/cgi-bin/VFs/v5/main.cgi	[24]	2004	×				×		×	×	×			×
BV-BRC	Bacterial and Viral Bioinformatics Resource Center for bacterial and viral infectious diseases	https://www.bv-brc.org	[25]	2023	×			×	×			×	×						×		×
PHIDIAS	The Pathogen-Host Interaction Data Integration and Analysis System (PHIDIAS) for pathogens with high priority in public health and biological defense.	https://phidias.us/introduction.php	[26]	2007	×				×			×			×	×				×	×	×
eHOMD	Human Oral Microbiome Database provides a comprehensive curated information on the bacterial species present in the human aerodigestive tract	http://www.homd.org	[27]	2010	×				×	×		×		×	×
BacDive	The Bacterial Diversity Metadatabase	https://bacdive.dsmz.de/advsearch	[9]	2013	×				×	×	×	×	×	×	×	×		×	×
GOLD	Genomes Online Database	https://gold.jgi.doe.gov/organisms	[8]	1997	×	×	×	×	×		×				×	×
GBIF	Global Biodiversity Information Facility. Free and open access to biodiversity data	https://www.gbif.org/species/search	[28]	2001	×	×	×	×		×
ThermoBase	A Centralized Database of Thermophiles and Hyperthermophiles	https://exprotdb.com/organisms_browse.php	[29]	2022	×	×				×	×					×		×
HaloDom	Halophilic organisms across all domains	http://halodom.bio.auth.gr/?view=all_data	[30]	2017	×	×	×											×
BaAMPs	Database dedicated to antimicrobial peptides (AMPs) specifically tested against microbial biofilms.	http://www.baamps.it	[31]	2015	×												×
U.S National Fungus Collections	USDA Agricultural Research Service	https://fungi.ars.usda.gov	[32]	1890		×							×
Microfungi Collection Consortium	The Microfungi Collections Consortium (MiCC) is a collaborative effort to digitize specimen label data from North American microfungi	https://www.microfungi.org	NA	2014		×				×
Mycology Collection Portal	MyCoPortal is data derived from a network of universities, botanical gardens, museums, and agencies that provide taxonomic, environmental, and specimen-based information	https://www.mycoportal.org/	[33]	2012		×				×
EmsemblFungi	Ensembl Fungi is a browser for fungal genomes.	https://fungi.ensembl.org	[34]	2002		×						×	×		×
UNITE	Database and sequence management environment centered on the eukaryotic nuclear ribosomal ITS region	https://unite.ut.ee/index.php	[35]	2003		×	×						×
FungalTraits	Traits database of fungi and fungus-like stramenopiles	https://docs.google.com/spreadsheets/d/1cxImJWMYVTr6uIQXcTLwK1YNNzQvKJJifzzNpKCM6O0/edit?usp=sharing	[36]	2019		×			×			×
ViruSITE	Integrated database for viral genomics	http://www.virusite.org/index.php?nav=browse	[37]	2016				×					×							×	×
IRAM	Virus Capsids Database & Analysis Resources	NA	[38]	2019				×															×
VIPERdb	Virus Particle Explorer: a web interface for exploring virus structures	https://viperdb.org	[39]	2001				×					×							×	×		×
Virus-Host DB	Organizes data about the relationships between viruses and their hosts	https://www.genome.jp/virushostdb/	[40]	2016				×				×	×							×	×
Diatoms.org	Documents diversity of diatoms in North America, providing information on diatom identification, ecology and distribution	https://diatoms.org/	[10]	2022			×			×

Results and discussion

A centralized and accessible source of information

There are several online biological databases for microbial information (Table 1); however, each of them has limitations. For instance, most databases are dedicated exclusively to a specific group of organisms or characteristics (e.g. a database solely for bacteria or one for human pathogens) [41]. Consequently, there is not a centralized repository for downloading information on microbes from different domains that share a specific characteristic (e.g. all human pathogenic bacteria, fungi, and viruses) or for obtaining data on multiple characteristics across different groups of microorganisms (e.g. extremophile microbes in water microbiomes). Additionally, many databases require coding expertise to retrieve information for more than one organism (e.g. through Application Programming Interface (API) download) [42, 43]. While some databases offer tutorials on how to download this data, these can be time-consuming and may require the installation of additional packages. Furthermore, although most databases are peer-reviewed and published, their online resources may not remain active or maintained, rendering the database temporary and potentially unusable for future research [44, 45]. Lastly, current databases do not facilitate easy retrieval of custom lists of microbial species. This is especially critical when analysing metagenomic results, which can generate lists of hundreds or thousands of species, making manual investigation of each species practically impossible [46]. In contrast, TMD provides a centralized and comprehensive database that is easily accessible, searchable, and downloadable. This resource offers significant support to researchers by delivering a user-friendly platform for the broadscale understanding of microbial species, facilitating the generation of ecological hypotheses, and driving correlations.

To compile and centralize the most relevant information on microbial species, TMD organizes its data by domains (Fig. 1). Each domain is annotated based on its phenotype, such as Gram-staining for bacteria or genome type for viruses, and its ecological features, such as the source of microbiome, host, and extremophiles. Additionally, to address relevance in ecology or biotechnology, algae have been further divided into diatoms and other microalgae [47, 48], and a question regarding the multicellular or unicellular identity of fungi and algae has been added. This last addition recognizes the complexity arising from multicellularity spanning across various lineages [13, 49] and provides an initial list of only-macro- or only-microorganisms for other researchers.

Figure 1.

Structure of The Microbe Directory database. The information collected has been targeted by microbial domains. Data collected followed the schema shown.

Open in new tab Download slide

TMD database is composed of 17,849 curated microbial species from all domains of life, bacteria, archaea, virus and eukarya, including microfungi, and microalgae (Fig. 2). The compilation process is described in the Methods section. Briefly, a list of species names from bacteria, archaea, virus, fungi, and algae were downloaded from NCBI taxonomy and cleaned from redundant names. These species were assigned randomly in triplicates to almost 400 student volunteers. The volunteers were trained and provided with guidance on how to annotate characteristics of the microorganisms assigned before submitting the information into a predesigned online survey. After submitting, the information was compiled and cleaned for typo errors. Answers to questions with a unique answer were submitted to a voting system to identify the most probable answer (Supplementary Fig. 2). For instance, if the same bacterial species was assigned to three different volunteers, and one volunteer submitted it as a Gram-staining negative, but two others submitted it as a Gram-staining positive, the species ultimately was classified as Gram-staining positive based on a majority vote. In cases where the species was assigned only to an even number of volunteers (two or four volunteers), and there was not a consensus vote, the answers were classified as Incongruent. This provides an additional corroboration step and allows us to go back to those problematic species in future versions of the database. Additionally, data generated through Gemini AI (see the Methods section) were compared and merged into the database for further analysis.

Figure 2.

Diagram of The Microbe Directory database. (A) Volunteer recruitment, training, and data annotation. (B) Processing of the data manually collected. (C) Database construction using curated data. (D) Comparison and inclusion of AI-collected data.

Open in new tab Download slide

A user-friendly database interface easy to navigate and open access

The Microbe Directory website dashboard was created using ShinyApps by R. It has a three-pronged use case. Firstly, for exploratory analysis, manually curated datasets are available for search using the ‘Choose a Dataset’ tab. Using this tab, users can filter for particular terms in each column of the database and download the resulting subset as a CSV file. The datasets include an all-encompassing microbe directory (TMD) and subsets specific to bacteria, archaea, algae, fungi, and viruses. The dashboard also includes a ‘Batch Query’ tab, which allows users to submit a list of scientific names of microbes of interest. If included in TMD, annotations corresponding to the query microbes are returned as a table that can be downloaded in CSV format. Furthermore, to help with exploratory analysis, an ‘Exploration’ tab is available. Users can either carry over the results from the ‘Batch Query’ tab or choose one of the pre-curated datasets (i.e. TMD, Bacteria, Archaea, Algae, Fungi, Virus). Users can explore the dataset they choose by selecting an annotation column (e.g. microbiome, extremophile_type, biofilm_forming), which includes a description of the term. After selecting an annotation column to visualize, an interactive plot is rendered in which either the entire set or an intersection between multiple sets can be chosen by clicking on certain highlighted regions on the plot. Through this action, a subset of the dataset is generated based on which set was clicked on. Users can download this subset or the excluded data of the chosen subset using the respective download buttons. Users can also filter within the generated subset table and download the complement dataset.

How to use The Microbe Directory database to aid in metagenomic analyses

The purpose of TMD is not to conduct taxonomic classification during microbiome analysis. Rather, TMD serves as an information database of species, allowing for cross-referencing species names obtained from any taxonomic analysis to aid in the characterization of the taxa. To test the utility of TMD, taxonomic results from four metagenomic datasets were used (Supplementary Table 2).

Using data from the HMP [49], three example questions were investigated: What are the ratios of Gram-staining positive and negative bacteria in each body site, what are the oxygen requirements of bacteria in these sites, and what are the potential sources of viruses present in each body site? By employing TMD to annotate the metagenomic results of HMP data, some physiological trends were found (Fig. 3A). For example, a similar proportion of Gram-negative and Gram-positive bacteria was observed in the gingiva, throat, gastrointestinal tract, and retroauricular crease, although Gram-positive bacteria appeared to be distinctively prevalent in vagina. This result is consistent with metagenomic studies of the vagina microbiome, which have shown that in a healthy state, the Gram-positive bacteria Lactobacillus sp. is one of the most dominating taxa [50]. Similarly, bacteria from vagina, as well as the gastrointestinal tract, differed from other body sites on their bacteria oxygen requirements, with higher abundances of anaerobes, facultative anaerobes, and microaerophiles (Fig. 3B). The genus Lactobacillus, in fact, is also classified as an anaerobe, and previous research supports the higher abundance of anaerobes and microaerophiles in the vagina microbiota [50, 51]. Most of the species present in the HMP dataset are suspected to be part of the human commensal microbiome, including viruses such as herpesviruses, known to be highly widespread in the human population [52]. However, some viruses present in the HMP have other non-human origins, such as animals or plants (Fig. 3C). The highest heterogeneity of virus hosts corresponds to the gingiva and throat, which provides support to the notion that exogenous elements and microbes might be transported to the body through food, water, or air [53].

Figure 3.

Example uses of The Microbe Directory (TMD) with external datasets. (A–C) Whole genome sequencing (WGS) data from the Human Microbiome Project (HMP). (A) Percentage of bacteria by Gram-staining in each body site. (B) Relative abundance of bacteria by their oxygen requirements across body sites. (C) Viral species in each body site and their potential host of origin according to TMD. (E–F) WGS data from MetaSUB project. (E) Correlations of relative abundance of halophile (salt-loving microorganism) and spore-forming bacteria in cities close to the beach. (F) Heatmap of bacteria and their potential host of origin according to TMD. (G–H) WGS from Sponge Microbiome Project. (G) Percentage of bacteria from different microbiomes according to TMD in the three sponge species. (H) Percentage of bacteria and their different types of metabolism present in the sponge microbiome. (I) Relative abundance of pathogenic fungi present in all three sponge species.

Open in new tab Download slide

Another dataset used was the TARA Oceans metagenomes from the polar circle region [18]. Leveraging TMD extremophiles annotations, TMD provided valuable, and possibly novel insights into the presence and diversity of extreme-living microbes (known as extremophiles), in TARA metagenomes. By comparing the taxonomic names retrieved from TARA Oceans microbiome, with the extremophile lists in TMD, 631 taxa were catalogued as extremophiles in TARA, and revealed the presence of 11 distinct types of extremophiles (Fig. 3D), with certain species adapted to more than one form of environmental stress such as tolerant to both acidic (acidophile) and saline (halophile) conditions simultaneously. These microorganisms are described as polyextremophiles. Previous reports have shown the presence of psychrophilic and psychrotolerant microbes in polar environments [54], as well as the presence of acidophiles potentially from exposure to penguin guano and halophiles due to pockets of seawater trapped in closed basins [54]. However, TMD analysis showed a high abundance of thermophiles (microbes known to thrive in high temperatures). Thermophiles have been described in Antarctica in thermal springs, fumaroles, hot soils, and hydrothermal vents [55], but not in the polar region where these samples originate. This analysis depicts the possible research gap on high-temperature microbes in polar regions, and the potential novel molecular traits that allow them to thrive in these environments. TMD annotation provides new insights into the ecology of microorganisms that might otherwise go unnoticed.

Additionally, the MetaSUB dataset [17] was used to assess TMD annotations of urban microbiomes and investigate the correlation between salt-loving microbes (halophiles) and their prevalence in coastal cities due to their proximity to high-salinity waters. By employing TMD annotations to classify halophilic microbes in both coastal and non-coastal urban areas, our analysis revealed no statistically significant difference in the abundance of halophiles between coastal and non-coastal cities (Fig. 3E). However, a correlation emerged indicating lower levels of spore-forming bacteria in coastal cities. Few studies have accessed the presence of halophiles in city beaches [56, 57]. But no studies have performed a systematic comparison between non-coastal and coastal cities. Additionally, the higher presence of spore-forming microbes in non-coastal cities could be correlated to the season temperature, as previous research has shown an increase in the spore-forming bacteria Bacillus after heatwaves [58]. The analysis with TMD shows that more research into the microbial ecology of cities is needed.

Next, TMD analysis revealed that as expected due to the nature of the MetaSUB dataset, most species found in MetaSUB metagenomes originate from humans. However, there were some additional microbes originating from animals and plants (Fig. 3F). These results are consistent with previous studies describing the correlation between the human skin microbiome and the environment [59], as well as the influences of lifestyle and urban or rural settings on the microbiota of skin, saliva, and gut [60, 61].

An additional analysis using the sponge microbiome dataset taxonomic result [19, 20] and the TMD annotations revealed that the majority of microbial species in sponge metagenomes could be classified as microorganisms coming from water microbiomes (Fig. 3G). A smaller fraction originated from soil or other host microbiomes. This result differs from the original dataset publication where authors showed that, using 16S metabarcoding, microbes in the sponges differ from those in the seawater, with a small subset taxa overlap [19]. This might be because TMD depends heavily on previously published taxa, where microbiomes of hosts such as sponges and corals are underrepresented compared to water or soil microbiomes.

Additionally, microbial communities in sponges exhibited diverse metabolic capabilities according to TMD annotations, with chemo-organo-heterotrophs predominating (Fig. 3H). This observation aligns with previous findings examining metabolic pathways across various Ircinia sponge microbiomes, where genes associated with nitrogen, sulphur, and carbon metabolism were highlighted [20]. Finally, TMD helped to unveil the presence of pathogenic fungi in all three Ircinia sponge species (Fig. 3I). Although the abundance of these fungal species differed among the sponge species, it is noteworthy that marine sponge extracts, particularly from Ircinia, have exhibited significant antifungal activity against various pathogens, including Fusarium oxysporum [62, 63]. This finding implies a potential interaction or long-term colonization of fungal pathogens within these sponge ecosystems that needs to be further investigated.

When assessing the presence of species from each of the four studies in TMD, we found a consistent overlap of ~25% of taxa. This indicates that TMD is phylogenetically diverse and comprises a wide range of microbial species from various environments and hosts (Supplemental Fig. 3). While a 25% species representation may not be sufficient to draw definitive biological conclusions in a metagenomic study, it can still provide valuable insights into the biology and ecology of microbiomes. We anticipate that future versions of TMD will include a greater overlap of species.

Artificial intelligence versus domain of knowledge

Utilizing microbial annotations generated by Gemini AI by Google (See the Methods section), using search prompts, our search yielded 1340 taxa spanning bacteria, archaea, virus, and eukarya. Among these, 285 taxa overlapped with TMD annotations (Fig. 4A), with Archaea exhibiting the highest ratio of overlap (52%) and viruses the lowest (4.7%). Despite this low ratio, virus species showed the highest level of matching results with TMD annotations. Phenotypic traits like type of genome or genome strand demonstrated the highest concordance, whereas ecological features such as virus host exhibited greater disparity (Fig. 4B). These disparities in ecological features stemmed from instances where the AI provided ‘No’ responses for traits categorized as ‘Unknown’ data by TMD. Antagonistic activity, UV resistance, and biofilm formation are among the characteristics catalogued as ‘No’ by AI due to insufficient literature, whereas TMD volunteers recognized the gaps in research and literature and catalogued them as unknown.

Figure 4.

Comparison of Gemini AI and TMD annotations. (A) Overlapped and unique number of taxa in the total database and in each domain by each annotation approach (Gemini AI or TMD). (B) Matrix of annotations by AI compared with TMD in the overlapped taxa. Characteristics that match are depicted in black and discrepancies in white.

Open in new tab Download slide

Recently, AI tools have been developed to help with the laborious literature review process [64, 65]; however they require input libraries of articles to be screened by the AI tools. Given that the annotations by TMD are aimed for thousands of taxa, and multiple microbial characteristics, providing an input library to train AI models would be challenging. Hence, Gemini AI emerged as an alternative method, free to date, for gathering microbial data through prompts. Nevertheless, despite repeated submissions, Gemini consistently provided annotations for the same taxa, thereby restricting the breadth of accessible species through this approach. The limited accessibility of open-access research articles and Gemini's utilization of archives like PubMed might contribute to the limited number of AI microbial annotations, something that can also be a limiting factor for manual annotations if no institutional access is available. Unfortunately, during the revision process of this work, Gemini AI changed its model and no longer allowed access to external sources such as PubMed. Despite these limitations, freely available tools like Gemini could be used to guide manual annotations ultimately supporting the domain of knowledge.

The Microbe Directory builds opportunities between students and research

TMD database is the result of collaborative efforts involving hundreds of students who contributed through manual annotation work. Notably, TMD has offered invaluable learning opportunities amidst the challenges of the COVID-19 pandemic, enabling students to delve into microbiology, scientific literature analysis, and research methodologies. Originally conceived as an in-person project, TMD seamlessly transitioned into a remote collaboration platform, providing continuous training from 2019 to 2022. Faced with restrictions on in-person research activities during the pandemic, many students opted to dedicate additional hours weekly to delve deeper into microbial research, utilizing TMD as a platform for internships and earning research credits. TMD evolved into a foundational project for open research, facilitating remote scientific collaboration and engaging a larger number of students in scientific endeavours.

Conclusion

TMD provides a centralized and curated source of biological and ecological information of microbes from all realms, providing an understanding to the exponential growing dataset generated by sequencing studies. Using external datasets, we show the utility of TMD to understand metagenomic data and to unveil novel trends. However, as any biological database, TMD is far from complete, and authors should interpret results accordingly. As we continue to expand TMD, we hope that in the future more species can be annotated and biological interpretations are more robust.

Moreover, the TMD interface presents a user-friendly visualization of the database, catering to a diverse audience including researchers and students. Additionally, by juxtaposing TMD results with annotations derived from AI, we highlight the limitations that AI encounters in systematic literature review processes compared to the manual curation process of microbial data. Lastly, TMD bridges the gap between research and students by actively involving students from various backgrounds in data collection and analysis.

Acknowledgements

We would like to thank the Epigenomics Core Facility at Weill Cornell Medicine, the Scientific Computing Unit (SCU), XSEDE Supercomputing Resources, and the National Center for Advancing Translational Sciences of the National Institutes of Health. We would also like to thank Marcela Guevara, PhD, and Natalia Vargas, PhD, from University of Los Andes, for their guidance in understanding fungi biology. The following list corresponds to volunteer students who contributed to TMD. The list is provided in alphabetical order: Aakanksha Kuwar, Aamna Arshad, Aasta Jane Tuatis, Abdul Rahmani, Abdullah Junayed, Abid Chowdhury, Abigail Calumpit, Adib Rahim, Adrian David Aguilar, Adriana Kaganovski, Adrita Shafi, Afrin Ansary, Afsana Rahman, Ahmad Shaukat, Ahmed Khan, Aileen Li, Aisha Abid, Aisha Baig, Alan Mancheril, Alba Trepca, Albert Shilman, Alena LeBron, Alexander Forrest, Alexandra Solowinska, Alexis Mitelman, Alia Codelia-Anjum, Aliyah Shari, Aliza Katz, Alleyah Hanson, Amal Elroubat, Amber Arif, Amy Yu, Anastassiya Skripchenko, Andi Rustani, Andrew Zheng, Angel Denisse Desepeda, Aniqa Niha, Aniqua Tultul, Anita Guan, Anna Lin, Ansh Sharma, Ariadna Vanegas-Rubio, Ariel Glazman, Artem Belugin, Ashalee Hinds, Ashley Padilla, Asli Erem, Atquiya Rahman, Avrosina Kamel, Ayah Ahmad, Ayah Nahdi, Ayesha Ranjha, Aysha Tabassum, Bagtygul Doveltova, Balkaran Singh, Beliz Kayis, Binita Riffat, Bishoy Benjamin, Briteny Xu, Brittney Paniagua, Bruno Spinetti Moda, Caolain Mathers, Carmen Spicer, Celine Ramsingh, Charles Wong, Chennai Marcus, Chih Fu Hsiang, Christina Martinez, Christine Kuang, Christine St. Juste, Chukwudi Ojei, Dajana Reci, Daniel Iskhakov, Danying Guan, David Arce, Deepshikha Kewlani, Dhanwantari Chowdhury, Diana Cortez, Diana Sokolov, Diedreanna Welington, Dilara Portelli, Dion Dixie Oubrey, Donghun Han, Doris Chen, Dorothy Liu, Dua Ahmed, Dunia Climaco, Dwayne Seeram, Dylan Gaugh, Edmund Sagastume, Edwin Lin, Eleesia Hernandez, Elena Rahman, Eleonora Achrak, Elizabeth Ghobrial, Elizabeth Katanov, Era Methoxha, Erica Aguayza, Ethan Samuel Lin, Eugene Toth, Fabiha Rahman, Fabliha Hussain, Faiza Ahmed, Faria Haq, Faria Noor, Farida Mutawakil, Fariha Ahmed, Fariha Zinnah, Farjana Rahman, Fatima Jaafar Daher, Fatima Javed, Fatima Mazumder, Fawzia Rahman, Ferduse Rahman, Gazi Husain, Genesis Rodriguez, Grace Han, Guozhang Liang, Hamza Siddiqui, Hamzah Sami, Hardik Bhaskar, Harpreet Kaur, Haya Alkiswani, Hersha McDonald, Hiram Caraballo, Huiyi Wu, Iana Leshchynska, Ibnath Brishty, Ibrahim Rumi, Iftida Faria, Illan Tsalkind, Inuri Galle Liyanage, Isabella Jorge Piccinelli, Ivan Pena, Ivethy De La Rosa, Jahin Samiha, Jake Cohen, Jarair Tahsin, Jarka Faruq, Jason He, Jayati Mondal, Jeanne Jiang, Jeffry Guerrero, Jenny Diaz, Jenny Sitnik, Jiale Chen, Jin Hee Yoo, Jingshao Zhang, John Janosevic, Johnathon Brown, Jonathan Gao, Jonathan Koganov, Joo Sun Mun, Jose Agregado, Jose Leonardo Romero, Joselyn Avila, Julia Andreyeff, Julia Ceban, Junia Sharmin, JunXian (Victor) Ma, Justyna Kadyszewska, Kainat Yasin, Kanza Kahn, Katherine M Cruz, Katherine Michelle Gee, Kathy E Flores, Kayla Leiss, Kaylin Camidge, Kazi Muztabir, Kenyer Malcolm, Kevin Kemelmakher, Kiana Roque, Kimberly Campos, Kimberly Nelsen, Kiowa Gatewood, Kiran Javaid, Korina Kemelmakher, Koshiq Hossain, Laila Gad, Lamisa Rusmeha, Lauritta Gazaryan, Laury Arazi, Lawrence Graham, Leona Kwong, Leya Groysman, Liana Simontobova, Lingjing Wu, Madison Estrella, Mahnoor Memon, Mahum Fatima, Maida Wasim, Maira Urooj, Maira Zaidi, Maisha Uddin, Malika Alamova, Maria Angelica Meza, Mariam Farag, Martyna Kulec, Maryam Rauf, Mayumi Gurung, Melanie Tsyryulnikov, Michael Leyderman, Michael Suarez-Villamil, Michelle Chen, Michelle Ismailov, Michelle Thomas, Mikaila Corriders, Milana Kaziyeva, Miriam Mykho, Mitchell Gonzalez, Mohab Idris, Mohamed Eid, Mohammad Araf, Mohammad Rahman, Moitrayee Dasgupta, Momotaj Jahan, Monique Marie Vilme, Muntaha Munia, Muskan Kaur, Nabila Ibnath, Nadia Rahman, Nafeeza Ali, Nafisa Tanjim Rahman, Naomi Chin, Naomi Paulino Cepeda, Natalie Mendo, Naveed Shah, Nawshin Maleeha, Nekadam Nicole Sezan, Nekadam Sezan, Nelofer Turyali, Neranjan de Silva, Nicholas Khan, Nicole Lopez, Nicole Uzhca, Nisma Hussein, Noa Hacker, Noor Khan, Noor-Ul-Ain Nazir, Oasima Muner, Odinakhon Shamieva, Omina Farukhzoda, Onyekachi Okeke, Ottavia Bourdain, Palakpreet Kaur, Paolo Vallejo-Janeta, Patricia Garcia, Patrick Lilley, Pauline Xu, Pearl Mensah, Quincena Fiffee, Qurrahtulain Ashfaq, Radwa Elshafey, Raichel Philip, Rami Eskelin, Reet Kohli, Reima Begum, Reina Mock Wu, Reya Das, Rida Ahmed, Riham Islam, Rithika Lingala, Roban Shabbir, Robert Novo, Romana Ahmed, Ryan Feng, Sabab Al-Din, Sabiha Tasnim, Sahera Fareha, Saima Alam, Saima Rahman, Salma Benitez, Salma Elhassa, Samuel Iskhakov, Sana Abid, Sandy Selim, Saneera Kankanala, Sanjana Prasad, Saqifur Rahman, Sara Khasib, Sarah Boladale Johnson, Sarah Shalan, Savita Prasad, Shanjana Babar, Shanna Huang, Shar Zaman, Sharmin Delowar, Sharon Castro, Sheima Begum, Shela Akhtar, Sidorela Reci, Siem Satti, Smeily Rodriguez, Sobia Shahabuddin, Sonia Mim, Soobin Park, Sreevatshan Srinivasan, Stan Lytvynenko, Stanley Chen, Steven Lui, Styliana Maimos, Sumeran Singh, Suraj Patel, Suriya Islam, Syed Ahsan, Syeda Jannath, Sylricka Ruwadi, Tahira Hasan, Tanvir Alam, Tasfiha Tasreen, Tasmia Kabir, Tasmia Promi, Tasnem Kalil, Tasnuva Tarek, Tazpia Taleb, Terricia E Bascombe, Thaif Laskar, Thaira Ahmed, Tiffany Hamm, Tina Rozbiani, Tongtong Gu, Vaishnavi Rajesh, Valay Patel, Valentina Gonzalez, Valentina Volpe, Vanessa Arcos, Vasila Abdumuminova, Victor Ma, Vishwashiv Emogaje, Washif Awal, Xiao Wen Chen, Xiomara Quiroz, Yailyn Metz, Yana Glemaud, Yehia Ahmed, Yesid Albarracin, Yoobin Moon, Zachary Boodoo, Zaheen Hossain, Zahin Quayyum, Zainab Hussain, Zainab Javed, Zainab Sami, and Zakia Rahman.

Author contributions

Database structure design was done by M.A.S., H.B., K.R., and C.B. Data collection was performed by M.A.S., K.R., R.E., C.B., and H.B. Data curation and processing were performed by M.A.S., K.R., M.R.A., J.P., R.E. Website interface was built by M.R.A. and M.A.S. Volunteer training was performed by M.A.S., K.R., D.C.D., P.A., S.A.S., M.M.Z., and C.E.M. Manuscript writing and editing were performed by M.A.S., K.R., M.R.A., P.A., S.A.S., M.M.Z., and C.E.M. Metadata collection was performed by The Microbe Directory Consortium.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.C.E.M. is a co-Founder of Biotia and advisor to Mirimus.

Funding

C.E.M. was supported by the National Institutes of Health (U01DA053941, U54AG089334, R01AI151059), NASA (80NSSC24K0728, 80NSSC24K1052), and the WorldQuant Foundation.

Data availability

Data from The Microbe Directory can be found in the database interface: www.themicrobedirectory.com. Raw data from The Microbe Directory domains can be found in GitHub, along with scripts to create main figures, curated lists of species, and Gemini data: https://github.com/mariaasierra/themicrobedirectory. All metagenomic data used for technical validation can be found directly on the studies cited, or using the accession numbers in Supplementary Table 2, and abundance tables in the GitHub.

Annotation tutorials can be found in YouTube (https://www.youtube.com/@themicrobedirectory9883) and microbiology guides on Coda (https://coda.io/d/Microbiology-Guides_dI4sBF6pufK/The-Microbe-Directory_supfr#_luCYc).

References

Lawton

Martinez

Burks

Overview of the LiMB database

Nucleic Acids Res

1989

;

5885

–

10.1093/nar/17.15.5885

Sanger

Coulson

Friedmann

et al.

The nucleotide sequence of bacteriophage phiX174

J Mol Biol

1978

;

125

225

–

10.1016/0022-2836(78)90346-7

Danielewski

Szalata

Nowak

et al.

History of Biological Databases, their importance, and existence in modern scientific and policy context

Genes

2025

;

100

10.3390/genes16010100

Pearson

Söll

The Human Genome Project: a paradigm for information management in the life sciences

FASEB J

1991

;

–

10.1096/fasebj.5.1.1991581

Robert

Stegehuis

Stalpers

The MycoBank Engine and Related Databases

. www.mycobank.org.

2005

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Robertson

Döring

Guralnick

et al.

The GBIF integrated publishing toolkit: facilitating the efficient publishing of biodiversity data on the internet

PLoS One

2014

;

e102623

10.1371/journal.pone.0102623

Escapa

Chen

Huang

et al.

New insights into human nostril microbiome from the expanded Human Oral Microbiome Database (eHOMD): a resource for the microbiome of the human aerodigestive tract

Msystems

2018

;

10.1128/msystems.00187-18

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

Mukherjee

Stamatis

et al.

Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9

Nucleic Acids Res

2023

;

D957

–

Söhngen

Bunk

Boyke

Podstawka

Adam

et al.

BacDive—the bacterial diversity metadatabase

Nucleic Acids Res

2014

;

D592

–

10.

Spaulding

Potapova

Bishop

et al.

Diatoms.org: supporting taxonomists, connecting communities

Diatom Res

2021

;

291

–

304

10.1080/0269249X.2021.2006790

Google Scholar

Crossref

WorldCat

11.

Pundhir

Mehto

Jaiswal

KoboToolbox, in Open Electronic Data Capture Tools for Medical and Biomedical Research and Medical Allied Professionals

Cambridge, Massachusetts, United States

Academic Press

2024

241

–

329

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

12.

NCBI Taxonomy Database Dump Files

README file of NCBI Taxonomy Database Dump Files

. https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/taxdump_readme.txt.

2020

Accessed version 2020-07-01

13.

Pereira

Macroalgae

Encyclopedia

2021

;

177

–

10.3390/encyclopedia1010017

Google Scholar

Crossref

WorldCat

14.

Sierra

Bhattacharya

Ryon

et al.

The microbe directory v2. 0: an expanded database of ecological and phenotypical features of microbes

bioRxiv

10.1101/2019.12.20.860569

13 December 2019, preprint: not peer reviewed.

OpenURL Placeholder Text

WorldCat

Crossref

15.

Shaaban

Westfall

Mohammad

et al.

The Microbe Directory: an annotated, searchable inventory of microbes’ characteristics

Gates Open Res

2018

;

10.12688/gatesopenres.12772.1

16.

Open AI. ChatGPT. [Large language model]

. https://chat.openai.com/.

Accessed September 2023

17.

Turnbaugh

Ley

Hamady

et al.

The hUman Microbiome Project

Nature

2007

;

449

804

–

18.

Danko

Bezdan

Afshin

et al.

A global metagenomic map of urban microbiomes and antimicrobial resistance

Cell

2021

;

184

3376

–

93.e17

10.1016/j.cell.2021.05.002

19.

Gimmler

Korn

de Vargas

et al.

The Tara Oceans voyage reveals global diversity and distribution patterns of marine planktonic ciliates

Sci Rep

2016

;

33555

20.

Kelly

Carlson

Low

et al.

The relationship between microbiomes and selective regimes in the sponge genus Ircinia

Front Microbiol

2021

;

607289

10.3389/fmicb.2021.607289

21.

Kelly

Carlson

Low

et al.

Novel trends of genome evolution in highly complex tropical sponge microbiomes

Microbiome

2022

;

164

10.1186/s40168-022-01359-z

22.

Stothard

Van Domselaar

Gary

Shrivastava

Savita

et al.

BacMap: an interactive picture atlas of annotated bacterial genomes

Nucleic Acids Res

2005

;

D317

–

23.

Benton

King

Greenfield

et al.

The ATCC genome portal: microbial genome reference standards with data provenance

Microbiol Resour Announc

2021

;

e00818

–

24.

Chen

Yang

et al.

VFDB: a reference database for bacterial virulence factors

Nucleic Acids Res

2005

;

D325

–

25.

Olson

Assaf

Brettin

et al.

Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR

Nucleic Acids Res

2023

;

D678

–

26.

Xiang

Tian

PHIDIAS: a pathogen-host interaction data integration and analysis system

Genome Biol

2007

;

–

10.1186/gb-2007-8-7-r150

Google Scholar

Crossref

WorldCat

27.

Chen

W-H

Izard

et al.

The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information

Database

2010

;

2010

baq013

10.1093/database/baq013

28.

Telenius

Biodiversity information goes public: GBIF at your service

Nordic J Bot

2011

;

378

–

10.1111/j.1756-1051.2011.01167.x

Google Scholar

Crossref

WorldCat

29.

DiGiacomo

McKay

Davila

ThermoBase: a database of the phylogeny and physiology of thermophilic and hyperthermophilic organisms

PLoS One

2022

;

e0268253

10.1371/journal.pone.0268253

30.

Loukas

Kappas

Abatzopoulos

HaloDom: a new database of halophiles across all life domains

J Biol Res (Thessalon)

2018

;

–

10.1186/s40709-017-0072-0

31.

Di Luca

Maccari

Maisetta

et al.

BaAMPs: the database of biofilm-active antimicrobial peptides

Biofouling

2015

;

193

–

10.1080/08927014.2015.1021340

32.

Farr

Rossman

Castlebury

United States National Fungus Collections Fungus-Host Dataset

Ag Data Commons

2021

. Fungus-Host-Data_20211105

33.

Miller

Bates

The mycology collections portal (MyCoPortal)

IMA Fungus

2017

;

A65

–

34.

Hubbard

Barker

Birney

et al.

The Ensembl genome database project

Nucleic Acids Res

2002

;

–

35.

Abarenkov

Nilsson

Larsson

K-H

et al.

The UNITE database for molecular identification and taxonomic communication of fungi and other eukaryotes: sequences, taxa and classifications reconsidered

Nucleic Acids Res

2024

;

D791

–

36.

Põlme

Abarenkov

Nilsson

et al.

FungalTraits: a user-friendly traits database of fungi and fungus-like stramenopiles

Fungal Diversity

2020

;

105

–

Google Scholar

Crossref

WorldCat

37.

Stano

Beke

Klucar

viruSITE—integrated database for viral genomics

Database

2016

;

2016

baw162

10.1093/database/baw162

38.

Almansour

Alhagri

Alfares

et al.

IRAM: virus capsid database and analysis resource

Database

2019

;

2019

baz079

10.1093/database/baz079

39.

Montiel-Garcia

Santoyo-Rivera

et al.

VIPERdb v3. 0: a structure-based data analytics platform for viral capsids

Nucleic Acids Res

2021

;

D809

–

40.

Mihara

Nishimura

Shimizu

et al.

Linking virus genomes with host taxonomy

Viruses

2016

;

41.

Zhulin

Databases for microbiologists

J Bacteriol

2015

;

197

2458

–

42.

Disz

Akhter

Cuevas

et al.

Accessing the SEED genome databases via Web services API: tools for programmers

BMC Bioinf

2010

;

–

10.1186/1471-2105-11-319

Google Scholar

Crossref

WorldCat

43.

Deng

Brägelmann

Kryukov

et al.

FirebrowseR: an R client to the Broad Institute's Firehose Pipeline

Database

2017

;

2017

baw160

10.1093/database/baw160

44.

Wren

URL decay in MEDLINE—a 4-year follow-up study

Bioinformatics

2008

;

1381

–

10.1093/bioinformatics/btn127

45.

Wren

Bateman

Databases, Data Tombs and Dust in the Wind

Bioinformatics

, Vol.

Oxford University Press

2008

2127

–

46.

Siddle

Park

et al.

Benchmarking metagenomics tools for taxonomic classification

Cell

2019

;

178

779

–

47.

Serôdio

Lavaud

Diatoms and their ecological importance

. In:

Life Below Water

Cham, Switzerland

Springer

2022

304

–

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

48.

Mourya

Khan

Ahirwar

et al.

Latest trends and developments in microalgae as potential source for biofuels: the case of diatoms

Fuel

2022

;

314

122738

10.1016/j.fuel.2021.122738

Google Scholar

Crossref

WorldCat

49.

Umen

Herron

Green algal models for multicellularity

Annu Rev Genet

2021

;

603

–

10.1146/annurev-genet-032321-091533

50.

Ravel

Gajer

Abdo

et al.

Vaginal microbiome of reproductive-age women

Proc Natl Acad Sci USA

2011

;

108

4680

–

10.1073/pnas.1002611107

51.

Saraf

Sheikh

Ahmad

et al.

Vaginal microbiome: normalcy vs dysbiosis

Arch Microbiol

2021

;

203

3793

–

802

10.1007/s00203-021-02414-3

52.

Zerr

Meier

Selke

et al.

A population-based study of primary human herpesvirus 6 infection

N Engl J Med

2005

;

352

768

–

53.

Sharma

Bhatia

Singh Sodhi

et al.

Oral microbiome and health

AIMS Microbiol

2018

;

10.3934/microbiol.2018.1.42

54.

Hoover

Pikuta

Psychrophilic and Psychrotolerant Microbial Extremophiles in Polar Environments

Florida, United States

CRS Press

2010

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

55.

Flores

Amenábar

Blamey

Hot environments from Antarctica: source of thermophiles and hyperthermophiles, with potential biotechnological applications

. In:

Satyanarayana

Littlechild

Kawarabayasi

(eds),

Thermophilic Microbes in Environmental and Industrial Biotechnology: Biotechnology of Thermophiles.

Dordrecht: Springer

2013

;

–

118

10.1007/978-94-007-5899-5

Google Scholar

OpenURL Placeholder Text

WorldCat

Crossref

56.

Prashad

Ram

Survey of the bacterial diversity at two coastal beaches in Guyana

Biofarmasi J Nat Prod Biochem

2020

;

–

Google Scholar

OpenURL Placeholder Text

WorldCat

57.

Pandiyan

Kushwaha

Bagul

et al.

Halomonas icarae sp. nov., a moderately halophilic bacterium isolated from beach soil in India

Int J Syst Evol Microbiol

2021

;

004611

10.1099/ijsem.0.004611

Google Scholar

Crossref

WorldCat

58.

Fang

Guo

Zhang

et al.

Influence of heat events on the composition of airborne bacterial communities in urban ecosystems

Int J Environ Res Public Health

2018

;

2295

10.3390/ijerph15102295

59.

Callewaert

Ravard Helffer

Lebaron

Skin microbiome and its interplay with the environment

Am J Clin Dermatol

2020

;

–

10.1007/s40257-020-00551-x

60.

Ying

Zeng

D-N

Chi

et al.

The influence of age and gender on skin-associated microbial communities in urban and rural human populations

PLoS One

2015

;

e0141842

10.1371/journal.pone.0141842

61.

David

Materna

Friedman

et al.

Host lifestyle affects human microbiota on daily timescales

Genome Biol

2014

;

–

10.1186/gb-2014-15-7-r89

Google Scholar

Crossref

WorldCat

62.

El Amraoui

El Wahidi

Fassouane

In vitro screening of antifungal activity of marine sponge extracts against five phytopathogenic fungi

Springerplus

2014

;

–

10.1186/2193-1801-3-629

63.

Chen

Aktas

Konuklugil

et al.

A new fusarielin analogue from Penicillium sp. isolated from the Mediterranean sponge Ircinia oros

Tetrahedron Lett

2015

;

5317

–

10.1016/j.tetlet.2015.07.072

Google Scholar

Crossref

WorldCat

64.

van de Schoot

de Bruin

Schram

et al.

An open source machine learning framework for efficient and transparent systematic reviews

Nat Mach Intell

2021

;

125

–

10.1038/s42256-020-00287-7

Google Scholar

Crossref

WorldCat

65.

Gates

Sebastianski

et al.

The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr's relevance predictions in systematic and rapid reviews

BMC Med Res Method

2020

;

–

10.1186/s12874-020-01031-w

Google Scholar

Crossref

WorldCat

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
September 2025	674
October 2025	419
November 2025	153
December 2025	162
January 2026	187
February 2026	140
March 2026	129
April 2026	129
May 2026	72

Article Contents

The Microbe Directory: a centralized database for biological interpretation of microbiome data

Abstract

Introduction

Methods

Volunteers training and data curation

Data compilation and processing

Data generation with Gemini artificial intelligence

External datasets

Results and discussion

A centralized and accessible source of information

A user-friendly database interface easy to navigate and open access

How to use The Microbe Directory database to aid in metagenomic analyses

Artificial intelligence versus domain of knowledge

The Microbe Directory builds opportunities between students and research

Conclusion

Acknowledgements

Author contributions

Conflict of interest

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

Article Contents

The Microbe Directory: a centralized database for biological interpretation of microbiome data Open Access

Abstract

Introduction

Methods

Volunteers training and data curation

Data compilation and processing

Data generation with Gemini artificial intelligence

External datasets

Results and discussion

A centralized and accessible source of information

A user-friendly database interface easy to navigate and open access

How to use The Microbe Directory database to aid in metagenomic analyses

Artificial intelligence versus domain of knowledge

The Microbe Directory builds opportunities between students and research

Conclusion

Acknowledgements

Author contributions

Conflict of interest

Funding

Data availability

References

Supplementary data

Citations

Views

Altmetric

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

Gift article access

Gift article access

Gift article access

Gift article access

The Microbe Directory: a centralized database for biological interpretation of microbiome data