Making sense of bacterial DNA data

Researchers have assembled and characterised more than 660,000 bacterial genomes from publicly-available DNA data. Public databases are huge mixed collections of raw data and genomes of widely varying quality, but the new dataset, which is open to the research community, makes it easier to search bacterial genomes for features of interest and examine their evolutionary relationships.

Bacteria are the oldest and most abundant cellular organisms on the planet. They’re incredibly diverse and adaptable, able to survive in almost any environment, from the ocean depths to volcanic springs and even the desert. The human body itself is estimated to contain more bacterial cells than human cells.

‘Tidying up’ the data

Given this impressive diversity, microbiologists trying to understand how bacteria work and evolve have a long road ahead of them. Large volumes of bacterial DNA data are available in open repositories such as EMBL-EBI’s European Nucleotide Archive (ENA). However, many of these datasets are unprocessed, and the remainder have been assembled into genomes using different techniques over the years. Studying them together is a bit like trying to navigate while jumping between your car’s GPS, a paper map and Google Maps - it mostly works, but it will lead you astray precisely when you need it the most.

In an effort to harmonise the data, researchers at EMBL-EBI and the Wellcome Sanger Institute have reviewed all the bacterial datasets available in the ENA and used them to assemble over 660,000 bacterial genomes. Features of interest – such as antimicrobial resistance genes – have been documented, and are now easy to find in the new dataset.

“I study genomic elements that are able to move between different bacteria,” explains Grace Blackwell, Postdoctoral Fellow at EMBL-EBI and the Wellcome Sanger Institute. “To do this, I need to search and analyse as many bacterial genomes as possible. But public data can be quite messy and needs to be processed uniformly, including quality control, before it can be used for analysis. So along with a few colleagues, we decided to ‘tidy up’ the data and make it easier for scientists to ask research questions.”

A gift to the scientific community

Grace and her colleagues spent many months looking through the data, characterising and assembling more than 660,000 bacterial genomes, in the hope that it will help researchers across the globe. They did this for all the bacterial data available in the ENA as of December 2018.

This unique dataset, called COBSI Index, includes three different indices of the data and is now accessible using an FTP site. It integrates a range of different search and distance estimates, enabling researchers to check whether a sequence, gene, mutation or plasmid of interest are present in any of the genomes, and tell how related a set of genomes are.

Wider focus needed

While trawling through the data, the researchers were surprised to find that the majority of data comes from the same 20 species of bacteria. Notably, one third came from Salmonella enterica, a bacterium that causes foodborne illnesses leading to hospitalisations and deaths worldwide.

“The exercise gave us a detailed overview of the bacteria sequenced over the last 30 years,” explained Zam Iqbal, Group Leader at EMBL-EBI who was also involved in the project. “It confirms that researchers have been focusing on a small number of known pathogens. However, we know that antimicrobial resistance exists in a much wider range of contexts. This narrow sequencing focus is leaving us blind to both AMR genes and their vectors, a host of different mobile elements, which exist in other, less studied species. It shows that we need to widen the range of species we sequence, and to create better mechanisms for sharing the data with the community, so it’s useful to researchers and public health authorities alike.”

Explore further

BLACKWELL, G. A., et al. (2021). Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biology. Published online 09 11; DOI: 10.1371/journal.pbio.3001421.

Visit the COBS index FTP site.

Figshare repository which includes the characterisation data (including antimicrobial resistance gene data)

Before developing this resource, the Iqbal group produced a web-accessible BIGSI index of > 400,000 read sets in the ENA from 2016, enabling searches across all data. BIGSI is no longer live. However this new resource contains a faster and more compact version of BIGSI, called COBS and combined with the additional characterisation, assemblies and other indices, considerably increases the functionality.

Image credit: Karen Arnott/EMBL-EBI