Improving processing and quality of DNA data for biodiversity research

21 September 2021

A collaboration between EMBL-EBI’s European Nucleotide Archive (ENA) and the Global Biodiversity Information Facility (GBIF) has established automated processes for publishing better organised, cleaner and more up-to-date datasets on GBIF.

These datasets reuse the globally comprehensive DNA sequence data that ENA and its partners, the National Centre for Biotechnology Information (NCBI) and the DNA Data Bank of Japan, maintain in the International Nucleotide Sequence Database Collaboration (INSDC).

EMBL-EBI maintains ENA, which supplied the first DNA-derived dataset shared through GBIF in 2014. As a result of the recent collaboration, these records have been segmented into three different datasets containing sequence-based records, records associated with host organisms and records associated with environment sample identifiers.

An important data feed for biodiversity

"Sequencing is one of the most important data feeds for global biodiversity observation,” said Guy Cochrane, head of ENA. ”I am delighted that the GBIF and EMBL-EBI ENA teams are working together to extend and enhance the availability of comprehensive INSDC data through GBIF. Our continued work together on improving granularity and filtering of these data will provide an increasingly accurate and reliable body of openly available observations for the scientific community."

Many of the records coming from the EMBL-EBI datasets represent the sequences of specimens held in natural history institutions. Thanks to the clustering algorithm deployed last year and the inclusion of all specimen-related records from EMBL.

Find out more

Read the full announcement on the GBIF website.

This work complements EMBL-EBI and GBIF's earlier efforts to improve the connections between metagenomics and species occurrence data.

Looking for something specific?