Abstract

The Microbe Directory (TMD) is a centralized database of metadata for microbes from all domains that helps with the biological interpretation of metagenomic data. The database comprises phenotypical and ecological traits of microorganisms, which have been verified by independent manual annotations. This effort has been possible by the help of a community of volunteer students worldwide who were trained in manual curation of microbiology data. To summarize this information, we have built an interactive browser that makes the database accessible to everyone, including non-bioinformaticians. We used the TMD data to analyse microbiome samples from different projects such as MetaSUB, TARA Oceans, Human Microbiome Project, and Sponge Microbiome Project, showcasing the utility of TMD. Furthermore, we compare our microbial annotations with annotations collected by artificial intelligence (AI) and demonstrate that despite the high speed of AI in reviewing and collecting microbial data, annotation requires domain knowledge and therefore manual curation. Collectively, TMD provides a unique source of information that can help to interpret microbiome data and uncover biological associations.

Database URL: www.themicrobedirectory.com/

Introduction

The development and use of databases in science is not a recent phenomenon. In recent decades, databases have become a cornerstone of biological research, largely due to advances in molecular techniques [1], which have enabled access to the hundreds to millions of letters that make up a genome.

The first sequenced genome—a bacteriophage—completed by Fredrick Sanger in 1977 [2], was published in its entirety within the manuscript, as no public storage repositories existed at the time to house the 5386-base pair (bp) genome. As a comparison, if the human genome had instead been printed, it would have required more than 1.5 million paper pages to contain the 3.1 billion bp sequence. The sequencing of large genomes, especially the Human Genome Project (HGP), highlighted the need of a centralized method to store sequence information that could be easily accessed, and shareable among research teams worldwide [3, 4].

Today, the scale of sequencing has expanded dramatically, particularly in the field of metagenomics. To date, more than 32,000 research articles describing metagenomes have been published in PubMed. Most of these studies represent newly sequenced metagenomes—a compilation of genomes that likely represent hundreds or thousands of species. Manual investigation of these complex datasets is time-consuming and may lead to a never-ending data curation process. This represents an immediate challenge for researchers attempting to interpret biological data and generate biological hypotheses from sequencing results.

The Microbe Directory (TMD) database was developed as a necessity for a free, user-friendly, and centralized database of microbial features and ecological characteristics. It is designed to be accessible to students, scientists, and clinicians. TMD is not a sequence repository, nor a taxonomic classification tool. Instead, TMD compiles metadata of microbes from all domains of life to aid with the interpretation of metagenomic outputs. By manual annotation and curation, TMD has compiled information to date on almost 18,000 species of bacteria (5201), archaea (873), algae (4209), fungi (3926), and viruses (3,596). The database provides information on 42 microbial features such as the type of genetic material in viruses and the type of Gram-staining for bacteria, as well as ecological characteristics such as the source of the microbiome (e.g. water or soil) or the ability to withstand extreme conditions (i.e. extremophiles). This represents a significant upgrade from the first two versions of the database [5, 6], with a more comprehensive dataset, extended annotations, and a user-friendly data interface that empowers users to navigate and utilize the database effectively.

Lastly, we underscore the importance of manual annotations in TMD, comparing them with annotations generated by artificial intelligence (AI). Despite the recent strides made in AI, our findings demonstrate the continued necessity for a domain of knowledge and, consequently, manual curation in creating and maintaining databases like TMD.

Methods

Volunteers training and data curation

To build the database, manual annotations were performed during the years 2019–22 with the help of almost 400 volunteers. Most of these volunteers were undergraduate students, with some high school and graduate students also participating. The students were recruited primarily through postings on social media and word-of-mouth among students. Volunteers joined the project from various cities in the USA, as well as other countries, including India, Colombia, Ecuador, and Brazil. The selection criteria included only a basic knowledge in microbiology, as the project aimed to provide students with tools and opportunities for remote learning.

All volunteers were trained to read articles, search on repositories, and identify information relevant for the annotation process by Zoom training sessions, video tutorials, and documented guide. This method ensured that all volunteers had an understanding of the research articles, the questions queried in the database, the microbial annotation, and submission process. Each week, the volunteers received a list of 10 microbial species by email and spent on average 8–13 h per week researching the species assigned, and submitting the information queried. Volunteers researched information of microbes from different articles and databases such as PubMed, the American Type Culture Collection (ATCC), MycoBank [5], Global Biodiversity Information Facility (GBIF) [6], eHOMD [7], GOLD [8], BacDive [9], and Diatoms.org [10], along with other openly accessible sources listed in the provided microbiology guides.

To compile information, volunteers submitted their microbial annotations in a customized survey in KoboToolBox [11], Supplementary Fig. 1. Information compiled by each volunteer was later summarized into a single file and cleaned from typos or redundancies using a custom script (see GitHub in the Data availability section).

Data compilation and processing

A list of all microbial species names, including bacteria, archaea, virus, fungi, and algae, were downloaded from NCBI taxonomy in the following link: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/ version 2020-07-01 in a taxdump format. The format contains the full taxonomic structure used by NCBI for classifying organisms. It is widely used in bioinformatics for mapping organism names to their taxonomy, navigating the taxonomic hierarchy, map organism names to NCBI taxon IDs, and normalizing organism names across databases [12].

The species list was screened for redundant names using a list of exclusion words for each microbial group using a custom script available in our GitHub (https://github.com/mariaasierra/themicrobedirectory). The complete list of exclusion terms for each microbial group can be found in Supplementary Table 1.

Eukaryotic microorganisms do not differ taxonomically to multicellular organisms. For example, the microalgae Chlamydomonas sp. and macroalgae, Ulva sp. also known as sea lettuce, are within the same phylum clade Chlorophyta [13]. Therefore, species lists downloaded using the taxdump NCBI format included species of macroorganisms. To compile a list of only microbial eukaryotes, macrofungi and macroalgae were removed from the species list based on a literature review.

To compile the list of microfungi, known genera to comprise only macrofungi (with fruiting body) were removed from the species list retrieved from NCBI taxonomy. All genera of fungi were kept, except Chlorociboria, Paxina, Helvella, Morchella, Kalaharituber, Peziza, Pseudohelotium, Terfezia, Anthracobia, Isaria, Scutellinia, Rhizina, Phillipsia, Tuber, Choiromyces, Cordyceps, Daldinia, Poronia, Xylaria, Penzigia, Agaricus, Arachnion, Battarrea, Battarreoides, Bovista, Calvatia, Chlamydopus, Chlorophyllum, Coniolepiota, Coprinellus, Coprinopsis, Coprinus, Crucibulum, Disciseda, Gyrophragmium, Langermannia, Lepiota, Leucoagaricus, Lycoperdon, Macrolepiota, Montagnea, Montagnites, Mycenastrum, Parasola, Polyplocium, Psalliota, Secotium, Tulostoma, Xanthagaricus, Amanita, Amanita, Saproamanita, Bolbitius, Conocybe, Galeropsis, Pluteolus, Broomeia, Clavaria, Clavaria, Clavulinopsis, Mucronella, Cortinarius, Locellina, Chondrostereum, Cyphella, Claudopus, Clitopilus, Entoloma, Fistulina, Laccaria, Hygrocybe, Hygrophorus, Anellaria, Panaeolina, Panaeolus, Astrosporina, Crepidotus, Inocybe, Phaeoglabrotricha, Phaeosolenia, Lyophyllum, Podabrella, Termitomyces, Calyptella,Marasmius, Solenia, Cruentomycena, Favolaschia, Mycena, Flagelloscypha, Lachnella, Cyathus, Anthracophyllum, Gymnopus, Marasmiellus, Omphalotus, Phellorinia, Armillaria, Armillariella, Cyptotrama, Oudemansiella, Hymenopellis, Physalacria, Xerula, Pleurotus, Pluteus, Volvariella, Podaxis, Ozonium, Psathyrella, Pterula, Schizophyllum, Sebacina, Agrocybe, Deconica, Flammula, Galera, Gymnopilus, Hebeloma, Hymenogaster, Hypholoma, Kuehneromyces, Leratiomyces, Naucoria, Pholiota, Psilocybe, Stropharia, Tubaria, Amparoina, Cellypha, Clitocybe, Collybia, Lepista, Macrocybe, Melanoleuca, Omphalia, Tricholoma, Tricholomopsis, Tricholosporum, Trogia, Auricularia, Eichleriella, Exidia, Heterochaete, Aporpium, Aureoboletus, Boletus, Buchwaldoboletus, Chalciporus, Imleria, Leccinum, Octaviania, Xerocomellus, Phlebopus, Coniophora, Gyrodontium, Gyroporus, Melanogaster, Paxillus, Rhizopogon, Pisolithus, Scleroderma, Serpula, Suillus, Cantharellus, Pellicularia, Clavulina, Corticium, Cytidia, Dendrothele, Laetiporus, Tretopileus, Geasteropsis, Geastrum, Geastrum, Anthurus, Aseroe, Blumenavia, Clathrella, Clathrus, Ileodictyon, Itajahya, Jaczewskia, Kalchbrennera, Kalchbrennera, Mutinus, Phallus, Daedalea, Fomitopsis, Gloeocystidium, Phaeolus, Rhodofomitopsis, Amauroderma, Ganoderma, Gloeophyllum, Ramaria, Clavariadelphus, Coltricia, Fomitoparia, Fuscoporia, Hydnum, Hymenochaete, Phellinus, Polystictus, Trichaptum, Cotylidia, Grandinia, Heterochaete, Oxyporus, Crustodontia, Acia, Aegerita, Bjerkandera, Cymatoderma, Gloeoporus, Irpex, Laschia, Merulius, Mycoleptodon, Odontia, Phlebia, Podoscypha, Pseudolagarobasidium, Abortiporus, Coriolopsis, Coriolus, Daedaleopsis, Favolus, Fomes, Funalia, Grammothele, Heliocybe, Hexagonia, Lentinus, Lenzites, Lenzites, Lenzites, Lignosus, Microporus, Nigroporus, Neolentinus, Panus, Perenniporia, Picipes, Phellinus, Polyporus, Pycnoporus, Trametes, Lentinellus, Dentipellicula, Laxitextum, Asterostroma, Dichostereum, Lachnocladium, Peniophora, Lactarius, Lactifluus, Russula, Aleurodiscus, Stereum, Hypochnus, Thelephora, Arrhytidia, Calocera, Dacrymyces, Dacryopinax, Femsjonia, Naematoloma, Tremella, Phaeotremella, Sirobasidium, Pilobolus, Echinostelium, Cribraria, Dictydiaethalium, Licea, Lycogala, Reticularia, Tubifera, Diachea, Diderma, Didymium, Mucilago, Badhamia, Badhamiopsis, Craterium, Fuligo, Leocarpus, Physarella, Physarum, Willkommlangea, Amaurochaete, Comatricha, Enerthenema, Lamproderma, Stemonaria, Stemonitis, Stemonitopsis, Calomyxa, Arcyria, Hemitrichia, Metatrichia, Oligonema, Perichaena, Trichia, Ceratiomyxa, and C eratium.

To compile only microalgae, all taxa from the divisions Chlorophyta, Streptophyta, Rhodophyta, and Glaucophyta were kept, except Leiosporocerotopsida, Sphagnopsida, Ginkgoopsida, Andreaeobryopsida, Takakiopsida, Gnetopsida, Andreaeopsida, Haplomitriopsida, Pinopsida, Bryopsida, Jungermanniopsida, Magnoliopsida, Charophyceae, Oedipodiopsida, Marchantiopsida, Lycopodiopsida, Polytrichopsida, Polypodiopsida, Anthocerotopsida, Tetraphidopsida, and Cycadopsida.

Additionally, the type of eukaryote macro- or micro- was added to the annotation process done by the volunteers.

To add an extra layer of corroboration to our curation process, we assigned microbial species to, at least, three different randomized volunteers. Responses were compared and a majority-voting system was implemented. Answers where there was not a consensus response were labelled as ‘Incongruent’ (e.g. Gram-staining: positive and negative). In questions where multiple choices were allowed (e.g. microbiome source: soil and urban environments), all answers were kept and unified. See Supplemental Fig. 2 for an example.

Cleaning of the data compiled by volunteers was performed using custom scripts to remove typos in scientific names (see GitHub). Additionally, data from previous versions of TMD database [14, 15] were standardized with the newest microbial features and merged into the current dataset. Cleaned data files used to build the database can be found in GitHub (see the Data availability section).

Data generation with Gemini artificial intelligence

To compare the reliability and expertise of the manual annotation effort versus the speed of AI to annotate microbial characteristics, Gemini was tested using a series of prompts in Google Gemini AI studio v1.5 Flash (https://gemini.google.com/app) using the following link: https://ai.google.dev/aistudio. The prompts were submitted for archaea, bacteria, virus, microfungi, and microalgae. The features that were asked for bacteria and archaea included oxygen use, gram-stain or gram-staining, biofilm formation, spore formation, extremophile, type of extremophile, microbiome, metabolism, and pathogenicity. For virus, genetic material, strand, sense, capsid symmetry, envelope, pathogenicity, and host. For diatoms, type of growth, morphology, biofilm formation, UV resistance, extremophile, type of extremophile, lifestyle, symmetry, toxicity, antagonistic activity, and symbiosis.

The prompts submitted included the following statements: ‘Access PubMed and filter research papers in the field of microbiology that describe [domain] and mention any of the following characteristics: [feature 1, feature 2, feature 3, feature 4, etc.]’. Additionally, some facts were added to the prompts to guide Gemini while compiling the microbial data. The facts included ‘Archaea are not bacteria’ and ‘Gram-staining can be positive or negative’. To view the full list, please refer to Supplementary information. The prompt for each microbial group was submitted at least three times to get as many species entries as possible. No changes between resubmission of the prompts were made. After the third trial ~70% of the species names were duplicated, and the search effort was stopped.

Additionally, the free version of ChatGPT (version GPT-4o) by OpenAI [16] was tested to perform annotations with the prompts previously described for Gemini AI. However, ChatGPT was unable to directly access PubMed or other external databases for research papers. Since the purpose of the comparisons with TMD is to provide free annotations, we did not try the ChatGPT PLUS version.

External datasets

To demonstrate the utility of TMD database, four publicly available whole genome sequencing (WGS) datasets were downloaded: The Human Microbiome Project (HMP) [17], MetaSUB [18], TARA Oceans [19], and Sponge Microbiome Project (SMP) [20, 21]. For HMP, 19 randomized WGS samples were selected, including male and females. These samples comprised different body sites such as gingiva, gastrointestinal tract, right retroauricular crease, throat, and vagina (Supplementary Table 2). From the MetaSUB project, 24 WGS samples were downloaded. These included nine cities in nine countries with at least two different samples per city. Samples were selected semi-randomly to comprise beach and no-beach cities (i.e. beach city = Barcelona, Spain vs. no beach city = Berlin, Germany). From the TARA Oceans dataset, 10 samples from the Polar circle region were randomly downloaded. Lastly, from the SMP, two randomized samples from three sponge species from the genus Ircinia were selected (n = 6 total). Accession numbers for all samples can be found in Supplementary Table 2.

Raw FASTQ files from all datasets were downloaded and processed with the same computational pipeline, including taxonomic classification with Kraken2 with a custom microbial genome database that include all microbial genomes in RefSeq. Statistical analysis and comparisons with the TMD database were performed with custom R scripts, which can be found in GitHub: https://github.com/mariaasierra/themicrobedirectory (see the Data availability section).

Table 1.

Comparison of the Microbe Directory database and other publicly available microbial databases. Some of the characteristics that are included in TMD are individually found in some databases. [8, 9, 22–40]

Reference publicationDatabase publication yearthe microbe directory
DatabaseDescriptionLink databaseBacteria ArchaeaFungiAlgaeVirusGram- Stain (Bacteria)MicrobiomeMetabolismPathogenHostHuman body siteSpore formingOxygen useBiofilm formingExtrem-ophileAnti-microbial ResistanceGenetic material (Virus)Strand (Virus)Envelop (Virus)Capsid (Virus)
BACMAP Genome AtlasBacMap is an interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomeshttp://bacmap.wishartlab.com[22]2005××××××
ATCCAmerican Type Culture Collection collects, stores, and distributes standard reference microorganismshttps://www.atcc.org/About[23]1927×××××××
VFDBVirulence factors of pathogenic bacteriahttp://www.mgc.ac.cn/cgi-bin/VFs/v5/main.cgi[24]2004××××××
BV-BRCBacterial and Viral Bioinformatics Resource Center for bacterial and viral infectious diseaseshttps://www.bv-brc.org[25]2023×××××××
PHIDIASThe Pathogen-Host Interaction Data Integration and Analysis System (PHIDIAS) for pathogens with high priority in public health and biological defense.https://phidias.us/introduction.php[26]2007××××××××
eHOMDHuman Oral Microbiome Database provides a comprehensive curated information o­n the bacterial species present in the human aerodigestive tracthttp://www.homd.org[27]2010××××××
BacDiveThe Bacterial Diversity Metadatabasehttps://bacdive.dsmz.de/advsearch[9]2013×××××××××××
GOLDGenomes Online Databasehttps://gold.jgi.doe.gov/organisms[8]1997××××××××
GBIFGlobal Biodiversity Information Facility. Free and open access to biodiversity datahttps://www.gbif.org/species/search[28]2001×××××
ThermoBaseA Centralized Database of Thermophiles and Hyperthermophileshttps://exprotdb.com/organisms_browse.php[29]2022××××××
HaloDomHalophilic organisms across all domainshttp://halodom.bio.auth.gr/?view=all_data[30]2017××××
BaAMPsDatabase dedicated to antimicrobial peptides (AMPs) specifically tested against microbial biofilms.http://www.baamps.it[31]2015××
U.S National Fungus CollectionsUSDA Agricultural Research Servicehttps://fungi.ars.usda.gov[32]1890××
Microfungi Collection ConsortiumThe Microfungi Collections Consortium (MiCC) is a collaborative effort to digitize specimen label data from North American microfungihttps://www.microfungi.orgNA2014××
Mycology Collection PortalMyCoPortal is data derived from a network of universities, botanical gardens, museums, and agencies that provide taxonomic, environmental, and specimen-based informationhttps://www.mycoportal.org/[33]2012××
EmsemblFungiEnsembl Fungi is a browser for fungal genomes.https://fungi.ensembl.org[34]2002××××
UNITEDatabase and sequence management environment centered on the eukaryotic nuclear ribosomal ITS regionhttps://unite.ut.ee/index.php[35]2003×××
FungalTraitsTraits database of fungi and fungus-like stramenopileshttps://docs.google.com/spreadsheets/d/1cxImJWMYVTr6uIQXcTLwK1YNNzQvKJJifzzNpKCM6O0/edit?usp=sharing[36]2019×××
ViruSITEIntegrated database for viral genomicshttp://www.virusite.org/index.php?nav=browse[37]2016××××
IRAMVirus Capsids Database & Analysis ResourcesNA[38]2019××
VIPERdbVirus Particle Explorer: a web interface for exploring virus structureshttps://viperdb.org[39]2001×××××
Virus-Host DBOrganizes data about the relationships between viruses and their hostshttps://www.genome.jp/virushostdb/[40]2016×××××
Diatoms.orgDocuments diversity of diatoms in North America, providing information on diatom identification, ecology and distributionhttps://diatoms.org/[10]2022××
Reference publicationDatabase publication yearthe microbe directory
DatabaseDescriptionLink databaseBacteria ArchaeaFungiAlgaeVirusGram- Stain (Bacteria)MicrobiomeMetabolismPathogenHostHuman body siteSpore formingOxygen useBiofilm formingExtrem-ophileAnti-microbial ResistanceGenetic material (Virus)Strand (Virus)Envelop (Virus)Capsid (Virus)
BACMAP Genome AtlasBacMap is an interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomeshttp://bacmap.wishartlab.com[22]2005××××××
ATCCAmerican Type Culture Collection collects, stores, and distributes standard reference microorganismshttps://www.atcc.org/About[23]1927×××××××
VFDBVirulence factors of pathogenic bacteriahttp://www.mgc.ac.cn/cgi-bin/VFs/v5/main.cgi[24]2004××××××
BV-BRCBacterial and Viral Bioinformatics Resource Center for bacterial and viral infectious diseaseshttps://www.bv-brc.org[25]2023×××××××
PHIDIASThe Pathogen-Host Interaction Data Integration and Analysis System (PHIDIAS) for pathogens with high priority in public health and biological defense.https://phidias.us/introduction.php[26]2007××××××××
eHOMDHuman Oral Microbiome Database provides a comprehensive curated information o­n the bacterial species present in the human aerodigestive tracthttp://www.homd.org[27]2010××××××
BacDiveThe Bacterial Diversity Metadatabasehttps://bacdive.dsmz.de/advsearch[9]2013×××××××××××
GOLDGenomes Online Databasehttps://gold.jgi.doe.gov/organisms[8]1997××××××××
GBIFGlobal Biodiversity Information Facility. Free and open access to biodiversity datahttps://www.gbif.org/species/search[28]2001×××××
ThermoBaseA Centralized Database of Thermophiles and Hyperthermophileshttps://exprotdb.com/organisms_browse.php[29]2022××××××
HaloDomHalophilic organisms across all domainshttp://halodom.bio.auth.gr/?view=all_data[30]2017××××
BaAMPsDatabase dedicated to antimicrobial peptides (AMPs) specifically tested against microbial biofilms.http://www.baamps.it[31]2015××
U.S National Fungus CollectionsUSDA Agricultural Research Servicehttps://fungi.ars.usda.gov[32]1890××
Microfungi Collection ConsortiumThe Microfungi Collections Consortium (MiCC) is a collaborative effort to digitize specimen label data from North American microfungihttps://www.microfungi.orgNA2014××
Mycology Collection PortalMyCoPortal is data derived from a network of universities, botanical gardens, museums, and agencies that provide taxonomic, environmental, and specimen-based informationhttps://www.mycoportal.org/[33]2012××
EmsemblFungiEnsembl Fungi is a browser for fungal genomes.https://fungi.ensembl.org[34]2002××××
UNITEDatabase and sequence management environment centered on the eukaryotic nuclear ribosomal ITS regionhttps://unite.ut.ee/index.php[35]2003×××
FungalTraitsTraits database of fungi and fungus-like stramenopileshttps://docs.google.com/spreadsheets/d/1cxImJWMYVTr6uIQXcTLwK1YNNzQvKJJifzzNpKCM6O0/edit?usp=sharing[36]2019×××
ViruSITEIntegrated database for viral genomicshttp://www.virusite.org/index.php?nav=browse[37]2016××××
IRAMVirus Capsids Database & Analysis ResourcesNA[38]2019××
VIPERdbVirus Particle Explorer: a web interface for exploring virus structureshttps://viperdb.org[39]2001×××××
Virus-Host DBOrganizes data about the relationships between viruses and their hostshttps://www.genome.jp/virushostdb/[40]2016×××××
Diatoms.orgDocuments diversity of diatoms in North America, providing information on diatom identification, ecology and distributionhttps://diatoms.org/[10]2022××
Table 1.

Comparison of the Microbe Directory database and other publicly available microbial databases. Some of the characteristics that are included in TMD are individually found in some databases. [8, 9, 22–40]

Reference publicationDatabase publication yearthe microbe directory
DatabaseDescriptionLink databaseBacteria ArchaeaFungiAlgaeVirusGram- Stain (Bacteria)MicrobiomeMetabolismPathogenHostHuman body siteSpore formingOxygen useBiofilm formingExtrem-ophileAnti-microbial ResistanceGenetic material (Virus)Strand (Virus)Envelop (Virus)Capsid (Virus)
BACMAP Genome AtlasBacMap is an interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomeshttp://bacmap.wishartlab.com[22]2005××××××
ATCCAmerican Type Culture Collection collects, stores, and distributes standard reference microorganismshttps://www.atcc.org/About[23]1927×××××××
VFDBVirulence factors of pathogenic bacteriahttp://www.mgc.ac.cn/cgi-bin/VFs/v5/main.cgi[24]2004××××××
BV-BRCBacterial and Viral Bioinformatics Resource Center for bacterial and viral infectious diseaseshttps://www.bv-brc.org[25]2023×××××××
PHIDIASThe Pathogen-Host Interaction Data Integration and Analysis System (PHIDIAS) for pathogens with high priority in public health and biological defense.https://phidias.us/introduction.php[26]2007××××××××
eHOMDHuman Oral Microbiome Database provides a comprehensive curated information o­n the bacterial species present in the human aerodigestive tracthttp://www.homd.org[27]2010××××××
BacDiveThe Bacterial Diversity Metadatabasehttps://bacdive.dsmz.de/advsearch[9]2013×××××××××××
GOLDGenomes Online Databasehttps://gold.jgi.doe.gov/organisms[8]1997××××××××
GBIFGlobal Biodiversity Information Facility. Free and open access to biodiversity datahttps://www.gbif.org/species/search[28]2001×××××
ThermoBaseA Centralized Database of Thermophiles and Hyperthermophileshttps://exprotdb.com/organisms_browse.php[29]2022××××××
HaloDomHalophilic organisms across all domainshttp://halodom.bio.auth.gr/?view=all_data[30]2017××××
BaAMPsDatabase dedicated to antimicrobial peptides (AMPs) specifically tested against microbial biofilms.http://www.baamps.it[31]2015××
U.S National Fungus CollectionsUSDA Agricultural Research Servicehttps://fungi.ars.usda.gov[32]1890××
Microfungi Collection ConsortiumThe Microfungi Collections Consortium (MiCC) is a collaborative effort to digitize specimen label data from North American microfungihttps://www.microfungi.orgNA2014××
Mycology Collection PortalMyCoPortal is data derived from a network of universities, botanical gardens, museums, and agencies that provide taxonomic, environmental, and specimen-based informationhttps://www.mycoportal.org/[33]2012××
EmsemblFungiEnsembl Fungi is a browser for fungal genomes.https://fungi.ensembl.org[34]2002××××
UNITEDatabase and sequence management environment centered on the eukaryotic nuclear ribosomal ITS regionhttps://unite.ut.ee/index.php[35]2003×××
FungalTraitsTraits database of fungi and fungus-like stramenopileshttps://docs.google.com/spreadsheets/d/1cxImJWMYVTr6uIQXcTLwK1YNNzQvKJJifzzNpKCM6O0/edit?usp=sharing[36]2019×××
ViruSITEIntegrated database for viral genomicshttp://www.virusite.org/index.php?nav=browse[37]2016××××
IRAMVirus Capsids Database & Analysis ResourcesNA[38]2019××
VIPERdbVirus Particle Explorer: a web interface for exploring virus structureshttps://viperdb.org[39]2001×××××
Virus-Host DBOrganizes data about the relationships between viruses and their hostshttps://www.genome.jp/virushostdb/[40]2016×××××
Diatoms.orgDocuments diversity of diatoms in North America, providing information on diatom identification, ecology and distributionhttps://diatoms.org/[10]2022××
Reference publicationDatabase publication yearthe microbe directory
DatabaseDescriptionLink databaseBacteria ArchaeaFungiAlgaeVirusGram- Stain (Bacteria)MicrobiomeMetabolismPathogenHostHuman body siteSpore formingOxygen useBiofilm formingExtrem-ophileAnti-microbial ResistanceGenetic material (Virus)Strand (Virus)Envelop (Virus)Capsid (Virus)
BACMAP Genome AtlasBacMap is an interactive visual database containing hundreds of fully labeled, zoomable, and searchable maps of bacterial genomeshttp://bacmap.wishartlab.com[22]2005××××××
ATCCAmerican Type Culture Collection collects, stores, and distributes standard reference microorganismshttps://www.atcc.org/About[23]1927×××××××
VFDBVirulence factors of pathogenic bacteriahttp://www.mgc.ac.cn/cgi-bin/VFs/v5/main.cgi[24]2004××××××
BV-BRCBacterial and Viral Bioinformatics Resource Center for bacterial and viral infectious diseaseshttps://www.bv-brc.org[25]2023×××××××
PHIDIASThe Pathogen-Host Interaction Data Integration and Analysis System (PHIDIAS) for pathogens with high priority in public health and biological defense.https://phidias.us/introduction.php[26]2007××××××××
eHOMDHuman Oral Microbiome Database provides a comprehensive curated information o­n the bacterial species present in the human aerodigestive tracthttp://www.homd.org[27]2010××××××
BacDiveThe Bacterial Diversity Metadatabasehttps://bacdive.dsmz.de/advsearch[9]2013×××××××××××
GOLDGenomes Online Databasehttps://gold.jgi.doe.gov/organisms[8]1997××××××××
GBIFGlobal Biodiversity Information Facility. Free and open access to biodiversity datahttps://www.gbif.org/species/search[28]2001×××××
ThermoBaseA Centralized Database of Thermophiles and Hyperthermophileshttps://exprotdb.com/organisms_browse.php[29]2022××××××
HaloDomHalophilic organisms across all domainshttp://halodom.bio.auth.gr/?view=all_data[30]2017××××
BaAMPsDatabase dedicated to antimicrobial peptides (AMPs) specifically tested against microbial biofilms.http://www.baamps.it[31]2015××
U.S National Fungus CollectionsUSDA Agricultural Research Servicehttps://fungi.ars.usda.gov[32]1890××
Microfungi Collection ConsortiumThe Microfungi Collections Consortium (MiCC) is a collaborative effort to digitize specimen label data from North American microfungihttps://www.microfungi.orgNA2014××
Mycology Collection PortalMyCoPortal is data derived from a network of universities, botanical gardens, museums, and agencies that provide taxonomic, environmental, and specimen-based informationhttps://www.mycoportal.org/[33]2012××
EmsemblFungiEnsembl Fungi is a browser for fungal genomes.https://fungi.ensembl.org[34]2002××××
UNITEDatabase and sequence management environment centered on the eukaryotic nuclear ribosomal ITS regionhttps://unite.ut.ee/index.php[35]2003×××
FungalTraitsTraits database of fungi and fungus-like stramenopileshttps://docs.google.com/spreadsheets/d/1cxImJWMYVTr6uIQXcTLwK1YNNzQvKJJifzzNpKCM6O0/edit?usp=sharing[36]2019×××
ViruSITEIntegrated database for viral genomicshttp://www.virusite.org/index.php?nav=browse[37]2016××××
IRAMVirus Capsids Database & Analysis ResourcesNA[38]2019××
VIPERdbVirus Particle Explorer: a web interface for exploring virus structureshttps://viperdb.org[39]2001×××××
Virus-Host DBOrganizes data about the relationships between viruses and their hostshttps://www.genome.jp/virushostdb/[40]2016×××××
Diatoms.orgDocuments diversity of diatoms in North America, providing information on diatom identification, ecology and distributionhttps://diatoms.org/[10]2022××

Results and discussion

A centralized and accessible source of information

There are several online biological databases for microbial information (Table 1); however, each of them has limitations. For instance, most databases are dedicated exclusively to a specific group of organisms or characteristics (e.g. a database solely for bacteria or one for human pathogens) [41]. Consequently, there is not a centralized repository for downloading information on microbes from different domains that share a specific characteristic (e.g. all human pathogenic bacteria, fungi, and viruses) or for obtaining data on multiple characteristics across different groups of microorganisms (e.g. extremophile microbes in water microbiomes). Additionally, many databases require coding expertise to retrieve information for more than one organism (e.g. through Application Programming Interface (API) download) [42, 43]. While some databases offer tutorials on how to download this data, these can be time-consuming and may require the installation of additional packages. Furthermore, although most databases are peer-reviewed and published, their online resources may not remain active or maintained, rendering the database temporary and potentially unusable for future research [44, 45]. Lastly, current databases do not facilitate easy retrieval of custom lists of microbial species. This is especially critical when analysing metagenomic results, which can generate lists of hundreds or thousands of species, making manual investigation of each species practically impossible [46]. In contrast, TMD provides a centralized and comprehensive database that is easily accessible, searchable, and downloadable. This resource offers significant support to researchers by delivering a user-friendly platform for the broadscale understanding of microbial species, facilitating the generation of ecological hypotheses, and driving correlations.

To compile and centralize the most relevant information on microbial species, TMD organizes its data by domains (Fig. 1). Each domain is annotated based on its phenotype, such as Gram-staining for bacteria or genome type for viruses, and its ecological features, such as the source of microbiome, host, and extremophiles. Additionally, to address relevance in ecology or biotechnology, algae have been further divided into diatoms and other microalgae [47, 48], and a question regarding the multicellular or unicellular identity of fungi and algae has been added. This last addition recognizes the complexity arising from multicellularity spanning across various lineages [13, 49] and provides an initial list of only-macro- or only-microorganisms for other researchers.

Structure of The Microbe Directory database. The information collected has been targeted by microbial domains. Data collected followed the schema shown.
Figure 1.

Structure of The Microbe Directory database. The information collected has been targeted by microbial domains. Data collected followed the schema shown.

TMD database is composed of 17,849 curated microbial species from all domains of life, bacteria, archaea, virus and eukarya, including microfungi, and microalgae (Fig. 2). The compilation process is described in the Methods section. Briefly, a list of species names from bacteria, archaea, virus, fungi, and algae were downloaded from NCBI taxonomy and cleaned from redundant names. These species were assigned randomly in triplicates to almost 400 student volunteers. The volunteers were trained and provided with guidance on how to annotate characteristics of the microorganisms assigned before submitting the information into a predesigned online survey. After submitting, the information was compiled and cleaned for typo errors. Answers to questions with a unique answer were submitted to a voting system to identify the most probable answer (Supplementary Fig. 2). For instance, if the same bacterial species was assigned to three different volunteers, and one volunteer submitted it as a Gram-staining negative, but two others submitted it as a Gram-staining positive, the species ultimately was classified as Gram-staining positive based on a majority vote. In cases where the species was assigned only to an even number of volunteers (two or four volunteers), and there was not a consensus vote, the answers were classified as Incongruent. This provides an additional corroboration step and allows us to go back to those problematic species in future versions of the database. Additionally, data generated through Gemini AI (see the Methods section) were compared and merged into the database for further analysis.

Diagram of The Microbe Directory database. (A) Volunteer recruitment, training, and data annotation. (B) Processing of the data manually collected. (C) Database construction using curated data. (D) Comparison and inclusion of AI-collected data.
Figure 2.

Diagram of The Microbe Directory database. (A) Volunteer recruitment, training, and data annotation. (B) Processing of the data manually collected. (C) Database construction using curated data. (D) Comparison and inclusion of AI-collected data.

A user-friendly database interface easy to navigate and open access

The Microbe Directory website dashboard was created using ShinyApps by R. It has a three-pronged use case. Firstly, for exploratory analysis, manually curated datasets are available for search using the ‘Choose a Dataset’ tab. Using this tab, users can filter for particular terms in each column of the database and download the resulting subset as a CSV file. The datasets include an all-encompassing microbe directory (TMD) and subsets specific to bacteria, archaea, algae, fungi, and viruses. The dashboard also includes a ‘Batch Query’ tab, which allows users to submit a list of scientific names of microbes of interest. If included in TMD, annotations corresponding to the query microbes are returned as a table that can be downloaded in CSV format. Furthermore, to help with exploratory analysis, an ‘Exploration’ tab is available. Users can either carry over the results from the ‘Batch Query’ tab or choose one of the pre-curated datasets (i.e. TMD, Bacteria, Archaea, Algae, Fungi, Virus). Users can explore the dataset they choose by selecting an annotation column (e.g. microbiome, extremophile_type, biofilm_forming), which includes a description of the term. After selecting an annotation column to visualize, an interactive plot is rendered in which either the entire set or an intersection between multiple sets can be chosen by clicking on certain highlighted regions on the plot. Through this action, a subset of the dataset is generated based on which set was clicked on. Users can download this subset or the excluded data of the chosen subset using the respective download buttons. Users can also filter within the generated subset table and download the complement dataset.

How to use The Microbe Directory database to aid in metagenomic analyses

The purpose of TMD is not to conduct taxonomic classification during microbiome analysis. Rather, TMD serves as an information database of species, allowing for cross-referencing species names obtained from any taxonomic analysis to aid in the characterization of the taxa. To test the utility of TMD, taxonomic results from four metagenomic datasets were used (Supplementary Table 2).

Using data from the HMP [49], three example questions were investigated: What are the ratios of Gram-staining positive and negative bacteria in each body site, what are the oxygen requirements of bacteria in these sites, and what are the potential sources of viruses present in each body site? By employing TMD to annotate the metagenomic results of HMP data, some physiological trends were found (Fig. 3A). For example, a similar proportion of Gram-negative and Gram-positive bacteria was observed in the gingiva, throat, gastrointestinal tract, and retroauricular crease, although Gram-positive bacteria appeared to be distinctively prevalent in vagina. This result is consistent with metagenomic studies of the vagina microbiome, which have shown that in a healthy state, the Gram-positive bacteria Lactobacillus sp. is one of the most dominating taxa [50]. Similarly, bacteria from vagina, as well as the gastrointestinal tract, differed from other body sites on their bacteria oxygen requirements, with higher abundances of anaerobes, facultative anaerobes, and microaerophiles (Fig. 3B). The genus Lactobacillus, in fact, is also classified as an anaerobe, and previous research supports the higher abundance of anaerobes and microaerophiles in the vagina microbiota [50, 51]. Most of the species present in the HMP dataset are suspected to be part of the human commensal microbiome, including viruses such as herpesviruses, known to be highly widespread in the human population [52]. However, some viruses present in the HMP have other non-human origins, such as animals or plants (Fig. 3C). The highest heterogeneity of virus hosts corresponds to the gingiva and throat, which provides support to the notion that exogenous elements and microbes might be transported to the body through food, water, or air [53].

Example uses of The Microbe Directory (TMD) with external datasets. (A–C) Whole genome sequencing (WGS) data from the Human Microbiome Project (HMP). (A) Percentage of bacteria by Gram-staining in each body site. (B) Relative abundance of bacteria by their oxygen requirements across body sites. (C) Viral species in each body site and their potential host of origin according to TMD. (E–F) WGS data from MetaSUB project. (E) Correlations of relative abundance of halophile (salt-loving microorganism) and spore-forming bacteria in cities close to the beach. (F) Heatmap of bacteria and their potential host of origin according to TMD. (G–H) WGS from Sponge Microbiome Project. (G) Percentage of bacteria from different microbiomes according to TMD in the three sponge species. (H) Percentage of bacteria and their different types of metabolism present in the sponge microbiome. (I) Relative abundance of pathogenic fungi present in all three sponge species.
Figure 3.

Example uses of The Microbe Directory (TMD) with external datasets. (A–C) Whole genome sequencing (WGS) data from the Human Microbiome Project (HMP). (A) Percentage of bacteria by Gram-staining in each body site. (B) Relative abundance of bacteria by their oxygen requirements across body sites. (C) Viral species in each body site and their potential host of origin according to TMD. (E–F) WGS data from MetaSUB project. (E) Correlations of relative abundance of halophile (salt-loving microorganism) and spore-forming bacteria in cities close to the beach. (F) Heatmap of bacteria and their potential host of origin according to TMD. (G–H) WGS from Sponge Microbiome Project. (G) Percentage of bacteria from different microbiomes according to TMD in the three sponge species. (H) Percentage of bacteria and their different types of metabolism present in the sponge microbiome. (I) Relative abundance of pathogenic fungi present in all three sponge species.

Another dataset used was the TARA Oceans metagenomes from the polar circle region [18]. Leveraging TMD extremophiles annotations, TMD provided valuable, and possibly novel insights into the presence and diversity of extreme-living microbes (known as extremophiles), in TARA metagenomes. By comparing the taxonomic names retrieved from TARA Oceans microbiome, with the extremophile lists in TMD, 631 taxa were catalogued as extremophiles in TARA, and revealed the presence of 11 distinct types of extremophiles (Fig. 3D), with certain species adapted to more than one form of environmental stress such as tolerant to both acidic (acidophile) and saline (halophile) conditions simultaneously. These microorganisms are described as polyextremophiles. Previous reports have shown the presence of psychrophilic and psychrotolerant microbes in polar environments [54], as well as the presence of acidophiles potentially from exposure to penguin guano and halophiles due to pockets of seawater trapped in closed basins [54]. However, TMD analysis showed a high abundance of thermophiles (microbes known to thrive in high temperatures). Thermophiles have been described in Antarctica in thermal springs, fumaroles, hot soils, and hydrothermal vents [55], but not in the polar region where these samples originate. This analysis depicts the possible research gap on high-temperature microbes in polar regions, and the potential novel molecular traits that allow them to thrive in these environments. TMD annotation provides new insights into the ecology of microorganisms that might otherwise go unnoticed.

Additionally, the MetaSUB dataset [17] was used to assess TMD annotations of urban microbiomes and investigate the correlation between salt-loving microbes (halophiles) and their prevalence in coastal cities due to their proximity to high-salinity waters. By employing TMD annotations to classify halophilic microbes in both coastal and non-coastal urban areas, our analysis revealed no statistically significant difference in the abundance of halophiles between coastal and non-coastal cities (Fig. 3E). However, a correlation emerged indicating lower levels of spore-forming bacteria in coastal cities. Few studies have accessed the presence of halophiles in city beaches [56, 57]. But no studies have performed a systematic comparison between non-coastal and coastal cities. Additionally, the higher presence of spore-forming microbes in non-coastal cities could be correlated to the season temperature, as previous research has shown an increase in the spore-forming bacteria Bacillus after heatwaves [58]. The analysis with TMD shows that more research into the microbial ecology of cities is needed.

Next, TMD analysis revealed that as expected due to the nature of the MetaSUB dataset, most species found in MetaSUB metagenomes originate from humans. However, there were some additional microbes originating from animals and plants (Fig. 3F). These results are consistent with previous studies describing the correlation between the human skin microbiome and the environment [59], as well as the influences of lifestyle and urban or rural settings on the microbiota of skin, saliva, and gut [60, 61].

An additional analysis using the sponge microbiome dataset taxonomic result [19, 20] and the TMD annotations revealed that the majority of microbial species in sponge metagenomes could be classified as microorganisms coming from water microbiomes (Fig. 3G). A smaller fraction originated from soil or other host microbiomes. This result differs from the original dataset publication where authors showed that, using 16S metabarcoding, microbes in the sponges differ from those in the seawater, with a small subset taxa overlap [19]. This might be because TMD depends heavily on previously published taxa, where microbiomes of hosts such as sponges and corals are underrepresented compared to water or soil microbiomes.

Additionally, microbial communities in sponges exhibited diverse metabolic capabilities according to TMD annotations, with chemo-organo-heterotrophs predominating (Fig. 3H). This observation aligns with previous findings examining metabolic pathways across various Ircinia sponge microbiomes, where genes associated with nitrogen, sulphur, and carbon metabolism were highlighted [20]. Finally, TMD helped to unveil the presence of pathogenic fungi in all three Ircinia sponge species (Fig. 3I). Although the abundance of these fungal species differed among the sponge species, it is noteworthy that marine sponge extracts, particularly from Ircinia, have exhibited significant antifungal activity against various pathogens, including Fusarium oxysporum [62, 63]. This finding implies a potential interaction or long-term colonization of fungal pathogens within these sponge ecosystems that needs to be further investigated.

When assessing the presence of species from each of the four studies in TMD, we found a consistent overlap of ~25% of taxa. This indicates that TMD is phylogenetically diverse and comprises a wide range of microbial species from various environments and hosts (Supplemental Fig. 3). While a 25% species representation may not be sufficient to draw definitive biological conclusions in a metagenomic study, it can still provide valuable insights into the biology and ecology of microbiomes. We anticipate that future versions of TMD will include a greater overlap of species.

Artificial intelligence versus domain of knowledge

Utilizing microbial annotations generated by Gemini AI by Google (See the Methods section), using search prompts, our search yielded 1340 taxa spanning bacteria, archaea, virus, and eukarya. Among these, 285 taxa overlapped with TMD annotations (Fig. 4A), with Archaea exhibiting the highest ratio of overlap (52%) and viruses the lowest (4.7%). Despite this low ratio, virus species showed the highest level of matching results with TMD annotations. Phenotypic traits like type of genome or genome strand demonstrated the highest concordance, whereas ecological features such as virus host exhibited greater disparity (Fig. 4B). These disparities in ecological features stemmed from instances where the AI provided ‘No’ responses for traits categorized as ‘Unknown’ data by TMD. Antagonistic activity, UV resistance, and biofilm formation are among the characteristics catalogued as ‘No’ by AI due to insufficient literature, whereas TMD volunteers recognized the gaps in research and literature and catalogued them as unknown.

Comparison of Gemini AI and TMD annotations. (A) Overlapped and unique number of taxa in the total database and in each domain by each annotation approach (Gemini AI or TMD). (B) Matrix of annotations by AI compared with TMD in the overlapped taxa. Characteristics that match are depicted in black and discrepancies in white.
Figure 4.

Comparison of Gemini AI and TMD annotations. (A) Overlapped and unique number of taxa in the total database and in each domain by each annotation approach (Gemini AI or TMD). (B) Matrix of annotations by AI compared with TMD in the overlapped taxa. Characteristics that match are depicted in black and discrepancies in white.

Recently, AI tools have been developed to help with the laborious literature review process [64, 65]; however they require input libraries of articles to be screened by the AI tools. Given that the annotations by TMD are aimed for thousands of taxa, and multiple microbial characteristics, providing an input library to train AI models would be challenging. Hence, Gemini AI emerged as an alternative method, free to date, for gathering microbial data through prompts. Nevertheless, despite repeated submissions, Gemini consistently provided annotations for the same taxa, thereby restricting the breadth of accessible species through this approach. The limited accessibility of open-access research articles and Gemini's utilization of archives like PubMed might contribute to the limited number of AI microbial annotations, something that can also be a limiting factor for manual annotations if no institutional access is available. Unfortunately, during the revision process of this work, Gemini AI changed its model and no longer allowed access to external sources such as PubMed. Despite these limitations, freely available tools like Gemini could be used to guide manual annotations ultimately supporting the domain of knowledge.

The Microbe Directory builds opportunities between students and research

TMD database is the result of collaborative efforts involving hundreds of students who contributed through manual annotation work. Notably, TMD has offered invaluable learning opportunities amidst the challenges of the COVID-19 pandemic, enabling students to delve into microbiology, scientific literature analysis, and research methodologies. Originally conceived as an in-person project, TMD seamlessly transitioned into a remote collaboration platform, providing continuous training from 2019 to 2022. Faced with restrictions on in-person research activities during the pandemic, many students opted to dedicate additional hours weekly to delve deeper into microbial research, utilizing TMD as a platform for internships and earning research credits. TMD evolved into a foundational project for open research, facilitating remote scientific collaboration and engaging a larger number of students in scientific endeavours.

Conclusion

TMD provides a centralized and curated source of biological and ecological information of microbes from all realms, providing an understanding to the exponential growing dataset generated by sequencing studies. Using external datasets, we show the utility of TMD to understand metagenomic data and to unveil novel trends. However, as any biological database, TMD is far from complete, and authors should interpret results accordingly. As we continue to expand TMD, we hope that in the future more species can be annotated and biological interpretations are more robust.

Moreover, the TMD interface presents a user-friendly visualization of the database, catering to a diverse audience including researchers and students. Additionally, by juxtaposing TMD results with annotations derived from AI, we highlight the limitations that AI encounters in systematic literature review processes compared to the manual curation process of microbial data. Lastly, TMD bridges the gap between research and students by actively involving students from various backgrounds in data collection and analysis.

Acknowledgements

We would like to thank the Epigenomics Core Facility at Weill Cornell Medicine, the Scientific Computing Unit (SCU), XSEDE Supercomputing Resources, and the National Center for Advancing Translational Sciences of the National Institutes of Health. We would also like to thank Marcela Guevara, PhD, and Natalia Vargas, PhD, from University of Los Andes, for their guidance in understanding fungi biology. The following list corresponds to volunteer students who contributed to TMD. The list is provided in alphabetical order: Aakanksha Kuwar, Aamna Arshad, Aasta Jane Tuatis, Abdul Rahmani, Abdullah Junayed, Abid Chowdhury, Abigail Calumpit, Adib Rahim, Adrian David Aguilar, Adriana Kaganovski, Adrita Shafi, Afrin Ansary, Afsana Rahman, Ahmad Shaukat, Ahmed Khan, Aileen Li, Aisha Abid, Aisha Baig, Alan Mancheril, Alba Trepca, Albert Shilman, Alena LeBron, Alexander Forrest, Alexandra Solowinska, Alexis Mitelman, Alia Codelia-Anjum, Aliyah Shari, Aliza Katz, Alleyah Hanson, Amal Elroubat, Amber Arif, Amy Yu, Anastassiya Skripchenko, Andi Rustani, Andrew Zheng, Angel Denisse Desepeda, Aniqa Niha, Aniqua Tultul, Anita Guan, Anna Lin, Ansh Sharma, Ariadna Vanegas-Rubio, Ariel Glazman, Artem Belugin, Ashalee Hinds, Ashley Padilla, Asli Erem, Atquiya Rahman, Avrosina Kamel, Ayah Ahmad, Ayah Nahdi, Ayesha Ranjha, Aysha Tabassum, Bagtygul Doveltova, Balkaran Singh, Beliz Kayis, Binita Riffat, Bishoy Benjamin, Briteny Xu, Brittney Paniagua, Bruno Spinetti Moda, Caolain Mathers, Carmen Spicer, Celine Ramsingh, Charles Wong, Chennai Marcus, Chih Fu Hsiang, Christina Martinez, Christine Kuang, Christine St. Juste, Chukwudi Ojei, Dajana Reci, Daniel Iskhakov, Danying Guan, David Arce, Deepshikha Kewlani, Dhanwantari Chowdhury, Diana Cortez, Diana Sokolov, Diedreanna Welington, Dilara Portelli, Dion Dixie Oubrey, Donghun Han, Doris Chen, Dorothy Liu, Dua Ahmed, Dunia Climaco, Dwayne Seeram, Dylan Gaugh, Edmund Sagastume, Edwin Lin, Eleesia Hernandez, Elena Rahman, Eleonora Achrak, Elizabeth Ghobrial, Elizabeth Katanov, Era Methoxha, Erica Aguayza, Ethan Samuel Lin, Eugene Toth, Fabiha Rahman, Fabliha Hussain, Faiza Ahmed, Faria Haq, Faria Noor, Farida Mutawakil, Fariha Ahmed, Fariha Zinnah, Farjana Rahman, Fatima Jaafar Daher, Fatima Javed, Fatima Mazumder, Fawzia Rahman, Ferduse Rahman, Gazi Husain, Genesis Rodriguez, Grace Han, Guozhang Liang, Hamza Siddiqui, Hamzah Sami, Hardik Bhaskar, Harpreet Kaur, Haya Alkiswani, Hersha McDonald, Hiram Caraballo, Huiyi Wu, Iana Leshchynska, Ibnath Brishty, Ibrahim Rumi, Iftida Faria, Illan Tsalkind, Inuri Galle Liyanage, Isabella Jorge Piccinelli, Ivan Pena, Ivethy De La Rosa, Jahin Samiha, Jake Cohen, Jarair Tahsin, Jarka Faruq, Jason He, Jayati Mondal, Jeanne Jiang, Jeffry Guerrero, Jenny Diaz, Jenny Sitnik, Jiale Chen, Jin Hee Yoo, Jingshao Zhang, John Janosevic, Johnathon Brown, Jonathan Gao, Jonathan Koganov, Joo Sun Mun, Jose Agregado, Jose Leonardo Romero, Joselyn Avila, Julia Andreyeff, Julia Ceban, Junia Sharmin, JunXian (Victor) Ma, Justyna Kadyszewska, Kainat Yasin, Kanza Kahn, Katherine M Cruz, Katherine Michelle Gee, Kathy E Flores, Kayla Leiss, Kaylin Camidge, Kazi Muztabir, Kenyer Malcolm, Kevin Kemelmakher, Kiana Roque, Kimberly Campos, Kimberly Nelsen, Kiowa Gatewood, Kiran Javaid, Korina Kemelmakher, Koshiq Hossain, Laila Gad, Lamisa Rusmeha, Lauritta Gazaryan, Laury Arazi, Lawrence Graham, Leona Kwong, Leya Groysman, Liana Simontobova, Lingjing Wu, Madison Estrella, Mahnoor Memon, Mahum Fatima, Maida Wasim, Maira Urooj, Maira Zaidi, Maisha Uddin, Malika Alamova, Maria Angelica Meza, Mariam Farag, Martyna Kulec, Maryam Rauf, Mayumi Gurung, Melanie Tsyryulnikov, Michael Leyderman, Michael Suarez-Villamil, Michelle Chen, Michelle Ismailov, Michelle Thomas, Mikaila Corriders, Milana Kaziyeva, Miriam Mykho, Mitchell Gonzalez, Mohab Idris, Mohamed Eid, Mohammad Araf, Mohammad Rahman, Moitrayee Dasgupta, Momotaj Jahan, Monique Marie Vilme, Muntaha Munia, Muskan Kaur, Nabila Ibnath, Nadia Rahman, Nafeeza Ali, Nafisa Tanjim Rahman, Naomi Chin, Naomi Paulino Cepeda, Natalie Mendo, Naveed Shah, Nawshin Maleeha, Nekadam Nicole Sezan, Nekadam Sezan, Nelofer Turyali, Neranjan de Silva, Nicholas Khan, Nicole Lopez, Nicole Uzhca, Nisma Hussein, Noa Hacker, Noor Khan, Noor-Ul-Ain Nazir, Oasima Muner, Odinakhon Shamieva, Omina Farukhzoda, Onyekachi Okeke, Ottavia Bourdain, Palakpreet Kaur, Paolo Vallejo-Janeta, Patricia Garcia, Patrick Lilley, Pauline Xu, Pearl Mensah, Quincena Fiffee, Qurrahtulain Ashfaq, Radwa Elshafey, Raichel Philip, Rami Eskelin, Reet Kohli, Reima Begum, Reina Mock Wu, Reya Das, Rida Ahmed, Riham Islam, Rithika Lingala, Roban Shabbir, Robert Novo, Romana Ahmed, Ryan Feng, Sabab Al-Din, Sabiha Tasnim, Sahera Fareha, Saima Alam, Saima Rahman, Salma Benitez, Salma Elhassa, Samuel Iskhakov, Sana Abid, Sandy Selim, Saneera Kankanala, Sanjana Prasad, Saqifur Rahman, Sara Khasib, Sarah Boladale Johnson, Sarah Shalan, Savita Prasad, Shanjana Babar, Shanna Huang, Shar Zaman, Sharmin Delowar, Sharon Castro, Sheima Begum, Shela Akhtar, Sidorela Reci, Siem Satti, Smeily Rodriguez, Sobia Shahabuddin, Sonia Mim, Soobin Park, Sreevatshan Srinivasan, Stan Lytvynenko, Stanley Chen, Steven Lui, Styliana Maimos, Sumeran Singh, Suraj Patel, Suriya Islam, Syed Ahsan, Syeda Jannath, Sylricka Ruwadi, Tahira Hasan, Tanvir Alam, Tasfiha Tasreen, Tasmia Kabir, Tasmia Promi, Tasnem Kalil, Tasnuva Tarek, Tazpia Taleb, Terricia E Bascombe, Thaif Laskar, Thaira Ahmed, Tiffany Hamm, Tina Rozbiani, Tongtong Gu, Vaishnavi Rajesh, Valay Patel, Valentina Gonzalez, Valentina Volpe, Vanessa Arcos, Vasila Abdumuminova, Victor Ma, Vishwashiv Emogaje, Washif Awal, Xiao Wen Chen, Xiomara Quiroz, Yailyn Metz, Yana Glemaud, Yehia Ahmed, Yesid Albarracin, Yoobin Moon, Zachary Boodoo, Zaheen Hossain, Zahin Quayyum, Zainab Hussain, Zainab Javed, Zainab Sami, and Zakia Rahman.

Author contributions

Database structure design was done by M.A.S., H.B., K.R., and C.B. Data collection was performed by M.A.S., K.R., R.E., C.B., and H.B. Data curation and processing were performed by M.A.S., K.R., M.R.A., J.P., R.E. Website interface was built by M.R.A. and M.A.S. Volunteer training was performed by M.A.S., K.R., D.C.D., P.A., S.A.S., M.M.Z., and C.E.M. Manuscript writing and editing were performed by M.A.S., K.R., M.R.A., P.A., S.A.S., M.M.Z., and C.E.M. Metadata collection was performed by The Microbe Directory Consortium.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.C.E.M. is a co-Founder of Biotia and advisor to Mirimus.

Funding

C.E.M. was supported by the National Institutes of Health (U01DA053941, U54AG089334, R01AI151059), NASA (80NSSC24K0728, 80NSSC24K1052), and the WorldQuant Foundation.

Data availability

Data from The Microbe Directory can be found in the database interface: www.themicrobedirectory.com. Raw data from The Microbe Directory domains can be found in GitHub, along with scripts to create main figures, curated lists of species, and Gemini data: https://github.com/mariaasierra/themicrobedirectory. All metagenomic data used for technical validation can be found directly on the studies cited, or using the accession numbers in Supplementary Table 2, and abundance tables in the GitHub.

Annotation tutorials can be found in YouTube (https://www.youtube.com/@themicrobedirectory9883) and microbiology guides on Coda (https://coda.io/d/Microbiology-Guides_dI4sBF6pufK/The-Microbe-Directory_supfr#_luCYc).

References

1.

Lawton
 
JR
,
Martinez
 
FA
,
Burks
 
C
.
Overview of the LiMB database
.
Nucleic Acids Res
.
1989
;
17
:
5885
89
.

2.

Sanger
 
F
,
Coulson
 
AR
,
Friedmann
 
T
 et al.  
The nucleotide sequence of bacteriophage phiX174
.
J Mol Biol
.
1978
;
125
:
225
46
.

3.

Danielewski
 
M
,
Szalata
 
M
,
Nowak
 
JK
 et al.  
History of Biological Databases, their importance, and existence in modern scientific and policy context
.
Genes
.
2025
;
16
:
100
.

4.

Pearson
 
ML
,
Söll
 
D
.
The Human Genome Project: a paradigm for information management in the life sciences
.
FASEB J
.
1991
;
5
:
35
39
.

5.

Robert
 
V
,
Stegehuis
 
G
,
Stalpers
 
J
.
The MycoBank Engine and Related Databases
. www.mycobank.org.
2005
.

6.

Robertson
 
T
,
Döring
 
M
,
Guralnick
 
R
 et al.  
The GBIF integrated publishing toolkit: facilitating the efficient publishing of biodiversity data on the internet
.
PLoS One
.
2014
;
9
:
e102623
.

7.

Escapa
 
IF
,
Chen
 
T
,
Huang
 
Y
 et al.  
New insights into human nostril microbiome from the expanded Human Oral Microbiome Database (eHOMD): a resource for the microbiome of the human aerodigestive tract
.
Msystems
.
2018
;
3
.

8.

Mukherjee
 
S
,
Stamatis
 
D
,
Li
 
CT
 et al.  
Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9
.
Nucleic Acids Res
.
2023
;
51
:
D957
63
.

9.

Söhngen
 
C
,
Bunk
 
Boyke
,
Podstawka
 
Adam
 et al.  
BacDive—the bacterial diversity metadatabase
.
Nucleic Acids Res
.
2014
;
42
:
D592
99
.

10.

Spaulding
 
SA
,
Potapova
 
MG
,
Bishop
 
IW
 et al.  
Diatoms.org: supporting taxonomists, connecting communities
.
Diatom Res
.
2021
;
36
:
291
304
.

11.

Pundhir
 
A
,
Mehto
 
AK
,
Jaiswal
 
A
.
KoboToolbox, in Open Electronic Data Capture Tools for Medical and Biomedical Research and Medical Allied Professionals
.
Cambridge, Massachusetts, United States
:
Academic Press
,
2024
,
241
329
.

12.

NCBI Taxonomy Database Dump Files
.
README file of NCBI Taxonomy Database Dump Files
. https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/taxdump_readme.txt.
2020
.
Accessed version 2020-07-01
.

13.

Pereira
 
L
.
Macroalgae
.
Encyclopedia
.
2021
;
1
:
177
88
.

14.

Sierra
 
MA
,
Bhattacharya
 
C
,
Ryon
 
K
 et al.  
The microbe directory v2. 0: an expanded database of ecological and phenotypical features of microbes
.
bioRxiv
, ,
13 December 2019, preprint: not peer reviewed.

15.

Shaaban
 
H
,
Westfall
 
DA
,
Mohammad
 
R
 et al.  
The Microbe Directory: an annotated, searchable inventory of microbes’ characteristics
.
Gates Open Res
.
2018
;
2
:
3
.

16.

Open AI. ChatGPT. [Large language model]
. https://chat.openai.com/.
Accessed September 2023
.

17.

Turnbaugh
 
PJ
,
Ley
 
RE
,
Hamady
 
M
 et al.  
The hUman Microbiome Project
.
Nature
.
2007
;
449
:
804
10
.

18.

Danko
 
D
,
Bezdan
 
D
,
Afshin
 
EE
 et al.  
A global metagenomic map of urban microbiomes and antimicrobial resistance
.
Cell
.
2021
;
184
:
3376
93.e17
.

19.

Gimmler
 
A
,
Korn
 
R
,
de Vargas
 
C
 et al.  
The Tara Oceans voyage reveals global diversity and distribution patterns of marine planktonic ciliates
.
Sci Rep
.
2016
;
6
:
33555
.

20.

Kelly
 
JB
,
Carlson
 
DE
,
Low
 
JS
 et al.  
The relationship between microbiomes and selective regimes in the sponge genus Ircinia
.
Front Microbiol
.
2021
;
12
:
607289
.

21.

Kelly
 
JB
,
Carlson
 
DE
,
Low
 
JS
 et al.  
Novel trends of genome evolution in highly complex tropical sponge microbiomes
.
Microbiome
.
2022
;
10
:
164
.

22.

Stothard
 
P
,
Van Domselaar
 
Gary
,
Shrivastava
 
Savita
 et al.  
BacMap: an interactive picture atlas of annotated bacterial genomes
.
Nucleic Acids Res
.
2005
;
33
:
D317
20
.

23.

Benton
 
B
,
King
 
S
,
Greenfield
 
SR
 et al.  
The ATCC genome portal: microbial genome reference standards with data provenance
.
Microbiol Resour Announc
.
2021
;
10
:
e00818
21
.

24.

Chen
 
L
,
Yang
 
J
,
Yu
 
J
 et al.  
VFDB: a reference database for bacterial virulence factors
.
Nucleic Acids Res
.
2005
;
33
:
D325
28
.

25.

Olson
 
RD
,
Assaf
 
R
,
Brettin
 
T
 et al.  
Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR
.
Nucleic Acids Res
.
2023
;
51
:
D678
89
.

26.

Xiang
 
Z
,
Tian
 
Y
,
He
 
Y
.
PHIDIAS: a pathogen-host interaction data integration and analysis system
.
Genome Biol
.
2007
;
8
:
1
15
.

27.

Chen
 
T
,
Yu
 
W-H
,
Izard
 
J
 et al.  
The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information
.
Database
.
2010
;
2010
:
baq013
.

28.

Telenius
 
A
.
Biodiversity information goes public: GBIF at your service
.
Nordic J Bot
.
2011
;
29
:
378
81
.

29.

DiGiacomo
 
J
,
McKay
 
C
,
Davila
 
A
.
ThermoBase: a database of the phylogeny and physiology of thermophilic and hyperthermophilic organisms
.
PLoS One
.
2022
;
17
:
e0268253
.

30.

Loukas
 
A
,
Kappas
 
I
,
Abatzopoulos
 
TJ
.
HaloDom: a new database of halophiles across all life domains
.
J Biol Res (Thessalon)
.
2018
;
25
:
1
8
.

31.

Di Luca
 
M
,
Maccari
 
G
,
Maisetta
 
G
 et al.  
BaAMPs: the database of biofilm-active antimicrobial peptides
.
Biofouling
.
2015
;
31
:
193
99
.

32.

Farr
 
DF
,
Rossman
 
AY
,
Castlebury
 
LA
.
United States National Fungus Collections Fungus-Host Dataset
.
Ag Data Commons
 
2021
. Fungus-Host-Data_20211105

33.

Miller
 
AN
,
Bates
 
ST
.
The mycology collections portal (MyCoPortal)
.
IMA Fungus
.
2017
;
8
:
A65
66
.

34.

Hubbard
 
T
,
Barker
 
D
,
Birney
 
E
 et al.  
The Ensembl genome database project
.
Nucleic Acids Res
.
2002
;
30
:
38
41
.

35.

Abarenkov
 
K
,
Nilsson
 
RH
,
Larsson
 
K-H
 et al.  
The UNITE database for molecular identification and taxonomic communication of fungi and other eukaryotes: sequences, taxa and classifications reconsidered
.
Nucleic Acids Res
.
2024
;
52
:
D791
97
.

36.

Põlme
 
S
,
Abarenkov
 
K
,
Nilsson
 
RH
 et al.  
FungalTraits: a user-friendly traits database of fungi and fungus-like stramenopiles
.
Fungal Diversity
.
2020
;
105
:
1
16
.

37.

Stano
 
M
,
Beke
 
G
,
Klucar
 
L
.
viruSITE—integrated database for viral genomics
.
Database
.
2016
;
2016
:
baw162
.

38.

Almansour
 
I
,
Alhagri
 
M
,
Alfares
 
R
 et al.  
IRAM: virus capsid database and analysis resource
.
Database
.
2019
;
2019
:
baz079
.

39.

Montiel-Garcia
 
D
,
Santoyo-Rivera
 
N
,
Ho
 
P
 et al.  
VIPERdb v3. 0: a structure-based data analytics platform for viral capsids
.
Nucleic Acids Res
.
2021
;
49
:
D809
16
.

40.

Mihara
 
T
,
Nishimura
 
Y
,
Shimizu
 
Y
 et al.  
Linking virus genomes with host taxonomy
.
Viruses
.
2016
;
8
:
66
.

41.

Zhulin
 
IB
.
Databases for microbiologists
.
J Bacteriol
.
2015
;
197
:
2458
67
.

42.

Disz
 
T
,
Akhter
 
S
,
Cuevas
 
D
 et al.  
Accessing the SEED genome databases via Web services API: tools for programmers
.
BMC Bioinf
.
2010
;
11
:
1
11
.

43.

Deng
 
M
,
Brägelmann
 
J
,
Kryukov
 
I
 et al.  
FirebrowseR: an R client to the Broad Institute's Firehose Pipeline
.
Database
.
2017
;
2017
:
baw160
.

44.

Wren
 
JD
.
URL decay in MEDLINE—a 4-year follow-up study
.
Bioinformatics
.
2008
;
24
:
1381
85
.

45.

Wren
 
JD
,
Bateman
 
A
.
Databases, Data Tombs and Dust in the Wind
.
Bioinformatics
, Vol.
24
,
Oxford University Press
,
2008
,
2127
28
.

46.

Ye
 
SH
,
Siddle
 
KJ
,
Park
 
DJ
 et al.  
Benchmarking metagenomics tools for taxonomic classification
.
Cell
.
2019
;
178
:
779
94
.

47.

Serôdio
 
J
,
Lavaud
 
J
.
Diatoms and their ecological importance
. In:
Life Below Water
.
Cham, Switzerland
:
Springer
,
2022
,
304
12
.

48.

Mourya
 
M
,
Khan
 
MJ
,
Ahirwar
 
A
 et al.  
Latest trends and developments in microalgae as potential source for biofuels: the case of diatoms
.
Fuel
.
2022
;
314
:
122738
.

49.

Umen
 
J
,
Herron
 
MD
.
Green algal models for multicellularity
.
Annu Rev Genet
.
2021
;
55
:
603
32
.

50.

Ravel
 
J
,
Gajer
 
P
,
Abdo
 
Z
 et al.  
Vaginal microbiome of reproductive-age women
.
Proc Natl Acad Sci USA
.
2011
;
108
:
4680
87
.

51.

Saraf
 
VS
,
Sheikh
 
SA
,
Ahmad
 
A
 et al.  
Vaginal microbiome: normalcy vs dysbiosis
.
Arch Microbiol
.
2021
;
203
:
3793
802
.

52.

Zerr
 
DM
,
Meier
 
AS
,
Selke
 
SS
 et al.  
A population-based study of primary human herpesvirus 6 infection
.
N Engl J Med
.
2005
;
352
:
768
76
.

53.

Sharma
 
N
,
Bhatia
 
S
,
Singh Sodhi
 
A
 et al.  
Oral microbiome and health
.
AIMS Microbiol
.
2018
;
4
:
42
.

54.

Hoover
 
RB
,
Pikuta
 
EV
.
Psychrophilic and Psychrotolerant Microbial Extremophiles in Polar Environments
.
Florida, United States
:
CRS Press
,
2010
.

55.

Flores
 
PA
,
Amenábar
 
MJ
,
Blamey
 
JM
.
Hot environments from Antarctica: source of thermophiles and hyperthermophiles, with potential biotechnological applications
. In:
Satyanarayana
 
T
,
Littlechild
 
J
,
Kawarabayasi
 
Y
(eds),
Thermophilic Microbes in Environmental and Industrial Biotechnology: Biotechnology of Thermophiles.
 
Dordrecht: Springer
 
2013
;
99
118
.

56.

Prashad
 
L
,
Ram
 
M
.
Survey of the bacterial diversity at two coastal beaches in Guyana
.
Biofarmasi J Nat Prod Biochem
.
2020
;
18
:
57
64
.

57.

Pandiyan
 
K
,
Kushwaha
 
P
,
Bagul
 
SY
 et al.  
Halomonas icarae sp. nov., a moderately halophilic bacterium isolated from beach soil in India
.
Int J Syst Evol Microbiol
.
2021
;
71
:
004611
.

58.

Fang
 
Z
,
Guo
 
W
,
Zhang
 
J
 et al.  
Influence of heat events on the composition of airborne bacterial communities in urban ecosystems
.
Int J Environ Res Public Health
.
2018
;
15
:
2295
.

59.

Callewaert
 
C
,
Ravard Helffer
 
K
,
Lebaron
 
P
.
Skin microbiome and its interplay with the environment
.
Am J Clin Dermatol
.
2020
;
21
:
4
11
.

60.

Ying
 
S
,
Zeng
 
D-N
,
Chi
 
L
 et al.  
The influence of age and gender on skin-associated microbial communities in urban and rural human populations
.
PLoS One
.
2015
;
10
:
e0141842
.

61.

David
 
LA
,
Materna
 
AC
,
Friedman
 
J
 et al.  
Host lifestyle affects human microbiota on daily timescales
.
Genome Biol
.
2014
;
15
:
1
15
.

62.

El Amraoui
 
B
,
El Wahidi
 
M
,
Fassouane
 
A
.
In vitro screening of antifungal activity of marine sponge extracts against five phytopathogenic fungi
.
Springerplus
.
2014
;
3
:
1
5
.

63.

Chen
 
H
,
Aktas
 
N
,
Konuklugil
 
B
 et al.  
A new fusarielin analogue from Penicillium sp. isolated from the Mediterranean sponge Ircinia oros
.
Tetrahedron Lett
.
2015
;
56
:
5317
20
.

64.

van de Schoot
 
R
,
de Bruin
 
J
,
Schram
 
R
 et al.  
An open source machine learning framework for efficient and transparent systematic reviews
.
Nat Mach Intell
.
2021
;
3
:
125
33
.

65.

Gates
 
A
,
Gates
 
M
,
Sebastianski
 
M
 et al.  
The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr's relevance predictions in systematic and rapid reviews
.
BMC Med Res Method
.
2020
;
20
:
1
9
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data