Curation of a reference database of COI sequences for insect identification through DNA metabarcoding: COins Open Access

Reference database efficiency test

DNA-metabarcoding raw data (obtained from 54 bulk samples collected with Malaise traps) developed in Kirse et al. (35), using mlCOIintF (5′–ACA CTC TTT CCC TAC ACG ACG CTC TTC CGA TCT GGW ACW GGW TGA ACW GTW TAY CCY CC–3′) and dgHCO2198 (5′–GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC TTA AAC TTC AGG GTG ACC AAA RAA YCA–3′) primers pair, were obtained from the Sequence Read Archive (SRA) archive (project accession number PRJNA68109) and used to test the efficiency of the developed database. The bioinformatic analyses were performed using the QIIME2 platform (24). Raw sequences were denoised with the DADA2 algorithm (36) to remove errors and obtain the actual biological sequences (ASVs).

The ASV taxonomic assignment was then performed using two approaches: (i) BLAST+ local alignment between query and reference reads (sequence identity = 97%, minimum consensus among top hits = 80% (26)) and (ii) the naïve Bayes taxonomic classifier trained on the reference database using the fit-classifier sklearn method (confidence = 0.97 (37, 38)). Three different databases were used as reference: (i) the database developed in this study, hereafter named COins; (ii) MIDORI CO1 unique version 245 (Leray et al., in preparation; http://www.reference-midori.info/download.php#) and (iii) a reference database of COI sequences created using RESCRIPT software starting from animals’ COI sequences registered in BOLD (retrieving date July–August 2020), hereafter named ResBO (database available at https://osf.io/d4jra/).

Results

The database

A total of 5 065 234 insect COI sequences were mined from BOLD. After filtering (up to Step 4; Figure 1), 3 745 421 sequences were lost (mainly due to the removal of sequences lacking species-level identification). At the end of Step 6 (Figure 1), the database was composed of 532 617 unique sequences, belonging to >106 000 species of 27 different insect orders. The most represented order within COins is Lepidoptera, followed by Diptera and Coleoptera (Table 1). Only a few sequences of Zoraptera and Notoptera are present (Table 1).

Table 1.

Open in new tab

Number of unique sequences for each insect order included in the database

Order	Number of sequences
Archaeognatha	79
Blattodea	1558
Coleoptera	65 684
Dermaptera	140
Diptera	122 306
Embioptera	69
Ephemeroptera	7150
Hemiptera	28 494
Hymenoptera	58 124
Lepidoptera	209 290
Mantodea	378
Mecoptera	304
Megaloptera	281
Neuroptera	1821
Notoptera	3
Odonata	5142
Orthoptera	7369
Phasmatodea	172
Plecoptera	4733
Psocodea	1800
Raphidioptera	41
Siphonaptera	473
Strepsiptera	56
Thysanoptera	1778
Trichoptera	15 321
Zoraptera	2
Zygentoma	49

Order	Number of sequences
Archaeognatha	79
Blattodea	1558
Coleoptera	65 684
Dermaptera	140
Diptera	122 306
Embioptera	69
Ephemeroptera	7150
Hemiptera	28 494
Hymenoptera	58 124
Lepidoptera	209 290
Mantodea	378
Mecoptera	304
Megaloptera	281
Neuroptera	1821
Notoptera	3
Odonata	5142
Orthoptera	7369
Phasmatodea	172
Plecoptera	4733
Psocodea	1800
Raphidioptera	41
Siphonaptera	473
Strepsiptera	56
Thysanoptera	1778
Trichoptera	15 321
Zoraptera	2
Zygentoma	49

Table 1.

Open in new tab

Number of unique sequences for each insect order included in the database

Order	Number of sequences
Archaeognatha	79
Blattodea	1558
Coleoptera	65 684
Dermaptera	140
Diptera	122 306
Embioptera	69
Ephemeroptera	7150
Hemiptera	28 494
Hymenoptera	58 124
Lepidoptera	209 290
Mantodea	378
Mecoptera	304
Megaloptera	281
Neuroptera	1821
Notoptera	3
Odonata	5142
Orthoptera	7369
Phasmatodea	172
Plecoptera	4733
Psocodea	1800
Raphidioptera	41
Siphonaptera	473
Strepsiptera	56
Thysanoptera	1778
Trichoptera	15 321
Zoraptera	2
Zygentoma	49

Order	Number of sequences
Archaeognatha	79
Blattodea	1558
Coleoptera	65 684
Dermaptera	140
Diptera	122 306
Embioptera	69
Ephemeroptera	7150
Hemiptera	28 494
Hymenoptera	58 124
Lepidoptera	209 290
Mantodea	378
Mecoptera	304
Megaloptera	281
Neuroptera	1821
Notoptera	3
Odonata	5142
Orthoptera	7369
Phasmatodea	172
Plecoptera	4733
Psocodea	1800
Raphidioptera	41
Siphonaptera	473
Strepsiptera	56
Thysanoptera	1778
Trichoptera	15 321
Zoraptera	2
Zygentoma	49

Two metadata files associated with COins are available. The first one comprises the information on the identification procedure of the voucher specimens from which COI sequences included in the database were generated. The same information is reported also for all identical sequences within haplotypes that were removed in Step 5 of the database curation (Figure 1). The second file reports the information on identical sequences belonging to different species present within the database. These files can be consulted when any specific molecular identification obtained using COins is doubtful (available at https://doi.org/10.6084/m9.figshare.19130465.v1).

Database efficiency test

The 54 DNA-metabarcoding samples (32) used to test the database efficiency, included a total of 27 348 365 raw reads (mean per sample = 506,451.2 reads), after denoising and filtering 8312 ASVs were obtained. The two algorithms adopted in this study (BLAST+-based and fit-classifier sklearn) demonstrated a good congruence in the taxonomic assignments of the ASVs detected, with COins sharing the highest number of ASVs’ unique identifications between algorithms than the other databases, i.e. 80.6% in comparison to 73.6% for MIDORI and 67.8% for ResBO (Figure 2).

Figure 2.

Number of ASVs identified by the two taxonomic assignment algorithms adopted in this study, i.e. the machine learning-based algorithm fit-classifier sklearn (SK L) and the BLAST+ (BL+) algorithm, using each database: (a) MIDORI database, (b) COins database and (c) ResBO database. Numbers of common identifications between the two algorithms are also expressed in percentages.

The taxonomic assignments of these ASVs using as reference ResBO resulted in 2381 (using BLAST+ algorithm) and 2870 (fit-classifier sklearn algorithm) ASVs assigned to the Insecta class. COins identified 2368 (BLAST+) and 8026 (fit-classifier sklearn) Insecta ASVs, while MIDORI identified 1876 (BLAST+) and 3273 (fit-classifier sklearn) ASVs. Among them, order-level assignments were obtained for 2374 (BLAST+) and 2008 (fit-classifier sklearn) ASVs adopting ResBO as reference; 2367 (BLAST+) and 2611 (fit-classifier sklearn) ASVs using COins and 1864 (BLAST+) and 2219 (fit-classifier sklearn) ASVs using MIDORI (Figure 3). Regarding species-level assignments, the following results were obtained: ResBO identified 1530 (BLAST+) and 1608 (fit-classifier sklearn) ASVs to species; COins 2117 (BLAST+) and 2243 (fit-classifier sklearn) ASVs to species and MIDORI 1594 (BLAST+) and 1584 ASVs (fit-classifier sklearn) to species (Figure 3).

Figure 3.

Number of ASVs assigned to the different taxonomic levels (from order to species) when using ResBO, COins and MIDORI as reference. Numbers of assignments obtained using the BLAST+ (BL+) and fit-classifier sklearn (SK L) algorithms are specified too.

Among the species-level identified ASVs using BLAST+-based algorithm, 825 different species were recognized by MIDORI: 27 of them were shared with ResBO, which identified 887 species (Figure 4a). The highest number of species was found using COins, i.e. 1051, 184 of them in common with ResBO (Figure 4a). Using the BLAST+-based algorithm, 41.4% of the species were identically identified by the three reference databases (Figure 4a). A similar situation was observed when fit-classifier sklearn algorithm was applied, in fact 836 different species were identified by MIDORI (Figure 4b), 29 of them were shared with ResBO, which identified 866 species, and COins detected 1108, 202 in common with the last database (Figure 4b). Using this algorithm, the percentage of common species recognized by the three databases was 40.1% (Figure 4b).

Figure 4.

Number of species identified using each database MIDORI, COins and ResBO. (a) Number of species identified adopting the BLAST+ algorithm (BL+). (b) Number of species identified adopting fit-classifier sklearn algorithm (SK L). All values are also reported as percentages.

10.1371/journal.pone.0017497

COins identified some ASVs as belonging to Rickettsiales (<20), these ASVs were assigned to Insecta, Arthropoda or remained unassigned when using the other reference databases.

Discussion

In this study, a reference database of COI sequences (5′ region) for insects’ taxonomic assignment using DNA metabarcoding was developed, starting from the data available on BOLD. These data were filtered according to several criteria in order to remove sequences, which might be potential sources of error during taxonomic assignments of the ASVs. Different motivations for sequence removal—along with their implications—are discussed below.

Sequences associated with incorrect or invalid taxonomy. The most common situation was the presence of sequences annotated as insect but instead derived from other organisms, in particular Homo sapiens and also the most common bacterial endosymbionts of insects (e.g. Wolbachia and Rickettsia). The latter is an already well-known problem related to online reference databases (39). Filtering COI sequences separately as sub-datasets for each insect order allowed us to detect further inconsistencies between sequences’ variability and their associated taxonomy. In particular, during the alignment step, some sequences showing low overall homology with the others in the same sub-dataset were found to be related to misidentifications at the order level. Within this study framework, the official validity of all sequences-associated taxonomic names was intentionally not investigated, because of ongoing debates on the taxonomic status of some insect taxa. As a matter of fact, the increasingly common use of molecular taxonomy has introduced a bias in insect taxonomy: frequently, new species are recognized based on molecular information (e.g. through molecular species delimitation or in the context of DNA-barcoding studies) and named, but never, or only much later, formally described. These species names are not considered valid according with the International Code of Zoological Nomenclature (40) until the formal description of the species is published, but online databases include the reference sequences which allow their identification under the new species name. Nonetheless, the filters applied to the sequences, the manual filter in particular, allowed the detection and discarding of many invalid species names unrelated to the above-mentioned situations and possibly linked with the absence of species-level morphological identification (e.g. genera names followed by numeric or alphabetic codes, but also geographical names or person names replacing specific epithets). In case of doubt, the scientific works within which the sequences were developed were consulted.

Non-coding sequences were possibly derived from the amplification of numts (41), from sequencing errors, or from the lack of proper editing of electropherograms before data publication. This issue was particularly evident in the database alignment step, where many sequences were discarded since they introduced one or two bases’ gaps in the alignment.

Sequences not associated with species-level taxonomy within a reference database, especially if identified at the highest taxonomic ranks, appear to reduce the accuracy of the molecular identification, hindering the reaching of identifications at lower taxonomic levels. This scenario is also a likely explanation for some of the results achieved in the present study, i.e. the cases in which COins assigned the ASV at the species level, while ResBO assigned the same ASVs to a higher taxonomic level, despite the two databases include the same species-level identified reference sequences. At the same time, excluding from a reference database, the sequences not identified at the species level could potentially increase the number of missing identifications, especially when those sequences belong to the only representative of a specific taxon within the database.

Some of the sequences discarded from the database are clearly related to errors, and they could be the results of the lack of care of some BOLD users, as indeed is also a common situation in the case of other databases. The BOLD team routinely perform data curation, in particular checking discordant Barcode Index Number and suppressing potential erroneous sequences from the online database (42). As in the case of this study, the curation is performed manually. It is a time-consuming process done periodically, thus leaving some erroneous sequences in place for a while. This is why using publicly available data for developing DNA-metabarcoding reference databases for local use should always require a manual curation step (28).

The efficiency test on COins showed how this database has an identification efficiency comparable to that of the other databases (MIDORI and ResBO) at the highest taxonomic ranks (e.g. order and family), but it allows the assignment of a considerably higher number of ASVs to the species and genus levels, with a notable increase between 25% and 30% of species-level identifications.

The performed analyses also allowed observation to be made on the effect of using different assignment algorithms. The machine learning-based algorithm (fit-classifier sklearn) was found to assign a higher number of ASVs at any taxonomic level, compared with the BLAST+ algorithm (Figure 3). An evident bias of the use of the fit-classifier sklearn algorithm in association with COins is that almost all the ASVs detected in the samples analysed were assigned to Insecta (8026 ASVs out of 8312) even if some of them likely belong to other classes (e.g. sequences that MIDORI and ResBO assigned to Collembola or Arachnida). This is related to the underlying principle of machine learning-based algorithms, which assumes that all existing taxa are included in the reference used for the assignment (37, 38). Yet, this drawback is only associated with higher level taxonomic assignments and does not affect the accuracy of low-level ones. As a matter of fact, COins was the database for which the highest congruence between identification achieved through the two algorithms used in this study was achieved (Figure 2).

The results obtained using COins highlight the importance of manual curation during the development of reference databases for local use. The effort required is however undeniable. Unfortunately, fully automated filters that make sequences downloaded from public resources readily usable for metabarcoding taxonomic assignment are not yet available. In the meantime, it is necessary, albeit expensive and time-consuming, especially in terms of updating, to make high-quality data available for those metabarcoding software platforms that use local reference databases. Moreover, the direct interaction between software such as QIIME2 with the online BOLD COI database for metazoan ASV/OTU taxonomic assignment is also advisable.

Funding

The authors acknowledge the support of the Article Processing Charge (APC) central fund of the University of Milan (Italy) and the Department of Agricultural and Environmental Sciences of the University of Milan (Italy) which provided the postdoc fellowship of the first author (years 2020–2022).

Conflict of interest

None declared.

References

Hajibabaei

Shokralla

Zhou

et al. (

2011

)

Environmental barcoding: a next-generation sequencing approach for biomonitoring applications using river benthos

PLoS One

, e17497.doi:

10.1111/j.1365-294X.2012.05470.x

Taberlet

Coissac

Pompanon

et al. (

2012

)

Towards next-generation biodiversity assessment using DNA metabarcoding

Mol. Ecol.

2045

–

2050

.doi:

Staats

Arulandhu

Gravendeel

et al. (

2016

)

Advances in DNA metabarcoding for food and wildlife forensic species identification

Anal. Bioanal. Chem.

408

4615

–

4630

.doi:

10.1007/s00216-016-9595-8

Montagna

Berruti

Bianciotto

et al. (

2018

)

Differential biodiversity responses between kingdoms (plants, fungi, bacteria and metazoa) along an Alpine succession gradient

Mol. Ecol.

3671

–

3685

.doi:

Zhang

Liu

Gao

et al. (

2020

)

Tracing the edible and medicinal plant Pueraria montana and its products in the marketplace yields subspecies level distinction using DNA barcoding and DNA metabarcoding

Front. Pharmacol.

, 336.doi:

10.3389/fphar.2020.00336

Brunetti

Magoga

Gionechetti

et al. (

2021

)

Does diet breadth affect the complexity of the phytophagous insect microbiota? The case study of Chrysomelidae

Environ. Microbiol

.doi:

10.1111/1462-2920.15847

10.1371/journal.pone.0014280

deWaard

Mitchell

Keena

et al. (

2010

)

Towards a global barcode library for Lymantria (Lepidoptera: Lymantriinae) tussock moths of biosecurity concern

PLoS One

, e14280.doi:

Marullo

Mercati

and

Vono

(

2020

)

DNA barcoding: a reliable method for the identification of thrips species (Thysanoptera, Thripidae) collected on sticky traps in onion fields

Insects

, 489.doi:

10.3390/insects11080489

10.1038/s41598-021-85855-6

Magoga

Fontaneto

and

Montagna

(

2021

)

Factors affecting the efficiency of molecular species delimitation in a species-rich insect family

Mol. Ecol. Resour.

1475

–

1489

.doi:

10.1111/1755-0998.13352

10.

Gadawski

Montagna

Rossaro

et al. (

2022

)

DNA barcoding of chironomidae from the Lake Skadar region: reference library and a comparative analysis of the European fauna

Divers Distrib.

–

.doi:

11.

Deiner

Bik

Mächler

et al. (

2017

)

Environmental DNA metabarcoding: transforming how we survey animal and plant communities

Mol. Ecol.

5872

–

5895

.doi:

12.

Batovska

Piper

Valenzuela

et al. (

2021

)

Developing a non-destructive metabarcoding protocol for detection of pest insects in bulk trap catches

Sci. Rep.

, 7946.doi:

10.1038/s41467-020-14961-2

13.

Ficetola

Boyer

Valentini

et al. (

2021

)

Comparison of markers for the monitoring of freshwater benthic biodiversity through DNA metabarcoding

Mol. Ecol.

3189

–

3202

.doi:

14.

Marquina

Andersson

A.F.

and

Ronquist

(

2018

)

New mitochondrial primers for metabarcoding of insects, designed and evaluated using in silico methods

Mol. Ecol. Resour.

–

104

.doi:

10.1111/1755-0998.12942

15.

Alberdi

Razgour

Aizpurua

et al. (

2020

)

DNA metabarcoding and spatial modelling link diet diversification with distribution homogeneity in European bats

Nat. Commun.

, 1154.doi:

10.1111/j.1471-8286.2007.01678.x

16.

Hardulak

L.A.

Morinière

Hausmann

et al. (

2020

)

DNA metabarcoding for biodiversity monitoring in a national park: screening for invasive and pest species

Mol. Ecol. Resour.

1542

–

1557

.doi:

10.1111/1755-0998.13212

17.

Ratnasingham

and

Hebert

P.D.N.

(

2007

)

BOLD: the Barcode of Life Data System

(www.barcodinglife.org).

Mol. Ecol. Notes

355

–

364

.doi:

18.

Clark

Karsch-Mizrachi

Lipman

D.J.

et al. (

2016

)

GenBank

Nucleic Acids Res.

D67

–

D72

.doi:

19.

Folmer

Black

Hoeh

et al. (

1994

)

DNA primers for amplification of mitochondrial cytochrome c oxidase subunit I from diverse metazoan invertebrates

Mol. Mar. Biol. Biotech.

, 7881515.

10.1017/S0007485315000681

20.

Brandon-Mong

Gan

Sing

et al. (

2015

)

DNA metabarcoding of insects and allies: an evaluation of primers and pipelines

Bull. Entomol. Res.

105

717

–

727

.doi:

21.

Elbrecht

Braukmann

Ivanova

N.V.

et al. (

2019

)

Validation of COI metabarcoding primers for terrestrial arthropods

Peer J.

, e7745.doi:

10.7717/peerj.7745

22.

Ratnasingham

(

2019

)

mBRAVE: the multiplex barcode research and visualization environment

BISS

, e37986.doi:

10.3897/biss.3.37986

23.

Buchner

and

Leese

(

2020

)

BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems

MBMG

, e53535.doi:

10.3897/mbmg.4.53535

10.1038/s41587-019-0209-9

24.

Bolyen

Rideout

J.R.

Dillon

M.R.

et al. (

2019

)

Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2

Nat. Biotechnol.

852

–

857

.doi:

25.

Wang

Garrity

Tiedje

et al. (

2007

)

Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy

Appl. Environ. Microbiol.

5261

–

5267

.doi:

26.

Camacho

Coulouris

Avagyan

et al. (

2009

)

BLAST+: architecture and applications

BMC Bioinform.

, 421.doi:

10.1186/1471-2105-10-421

27.

Machida

Leray

S.L.

et al. (

2017

)

Metazoan mitochondrial gene sequence reference datasets for taxonomic assignment of environmental samples

Sci. Data

, 170027.doi:

10.1038/sdata.2017.27

10.1371/journal.pcbi.1009581

28.

Robeson

O’Rourke

Kaehler

et al. (

2021

)

RESCRIPt: reproducible sequence taxonomy reference database management

PLoS Comput. Biol.

, e1009581.doi:

10.1371/journal.pone.0226527

29.

Beentjes

Speksnijder

Schilthuizen

et al. (

2019

)

Increased performance of DNA metabarcoding of macroinvertebrates by taxonomic sorting

PLoS One

, e0226527.doi:

10.1016/s0168-9525(00)02024-2

30.

Altschul

S.F.

Madden

T.L.

Schäffer

A.A.

et al. (

1997

)

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

3389

–

3402

.doi:

10.1093/nar/25.17.3389

31.

Rice

Longden

and

Bleasby

(

2000

)

EMBOSS: the European molecular biology open software suite

Trends Genet.

276

–

277

.doi:

32.

Katoh

and

Standley

(

2013

)

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

Mol. Biol. Evol.

772

–

780

.doi:

10.1093/molbev/mst010

33.

Altschul

Gish

Miller

et al. (

1990

)

Basic local alignment search tool

J. Mol. Biol.

215

403

–

410

.doi:

10.1016/s0022-2836(05)80360-2

34.

Brown

Collins

Boyer

et al. (

2012

)

SPIDER: an R package for the analysis of species identity and evolution, with particular reference to DNA barcoding

Mol. Ecol. Resour.

562

–

565

.doi:

10.1111/j.1755-0998.2011.03108.x

35.

Kirse

Bourlat

Langen

et al. (

2021

)

Metabarcoding Malaise traps and soil eDNA reveals seasonal and local arthropod diversity shifts

Sci. Rep.

, 10498.doi:

10.1038/s41598-021-89950-6

36.

Callahan

McMurdie

Rosen

et al. (

2016

)

DADA2: high-resolution sample inference from Illumina amplicon data

Nat. Methods

581

–

583

.doi:

37.

Pedregosa

Varoquaux

Gramfort

et al. (

2011

)

Scikit-learn: machine learning in Python

J. Mach. Learn. Res.

2825

–

2830

10.1186/s40168-018-0470-z

38.

Bokulich

Kaehler

Rideout

et al. (

2018

)

Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s aq2-feature-classifier plugin

Microbiome

.doi:

10.1371/journal.pone.0036514

39.

Smith

Bertrand

Crosby

et al. (

2012

)

Wolbachia and DNA barcoding insects: patterns, potential, and problems

PLoS One

, e36514.doi:

40.

International Commission on Zoological Nomenclature

(

1999

)

International Code of Zoological Nomenclature. 4th edition. International Trust for Zoological Nomenclature

London, UK

Google Preview

41.

Song

Buhay

J.E.

Whiting

M.F.

et al. (

2008

)

Many species in one: DNA barcoding overestimates the number of species when nuclear mitochondrial pseudogenes are coamplified

Proc. Natl. Acad. Sci.

105

13486

–

13491

.doi:

10.1073/pnas.0803076105

42.

Coleman

C.O.

and

Radulovici

(

2020

)

Challenges for the future of taxonomy: talents, databases and knowledge growth

Megataxa

–

.doi:

10.11646/megataxa.1.1.5