GIFTS also aims to provide a common framework for Ensembl and UniProt data. This infrastructure enables the annotators and curators in both teams to read and comment on data, track information between resources and therefore, improve the consistency of mapping and annotation available to users.
A new platform
GIFTs is hosted on an independent webpage where researchers can access all combined genomic and protein data for the human and mouse genomes. This new website is made possible by the collaborative work of different teams at EMBL-EBI in particular the Ensembl and UniProt teams who together helped to automate the GIFTS website, making it more up-to-date, accessible and user friendly.
“In order to connect the genome and proteome worlds for maximum impact, for example, to enable the mining of genomes for functional variants and disease consequences, it is critical that we are able to map from genome coordinates to the corresponding protein residues,” says Maria Martin, Team Leader at EMBL-EBI.
Integrating genome and protein annotation
“Integrating genome and protein annotation is complex,” says Michele Magrane, Annotation Coordinator in the Protein Function Content Team at EMBL-EBI. “Ensembl focuses on the annotation of transcripts in reference genomes using available cDNA, EST and RNA-seq data, while UniProt focuses on annotating protein sequences using experimental evidence from the literature, homologs in other species and proteomics experiments.”
Working in this way means that the final results may not always match up. To solve this problem, it is important to map the transcripts annotated in Ensembl to the protein isoforms in UniProt. Curators and automatic pipelines at EMBL-EBI have been sharing this information between data resources for many years and today most of the protein sequences in UniProt align perfectly to translations from Ensembl transcripts.
“There are unique cases where problems can arise,” says Magrane. “One example is the SAC3D1 protein. UniProt has proteomics evidence that a 404 residue protein exists. Ensembl does not have a transcript for this protein and instead has one for a 358 residue protein. The 404 residue protein aligns to the genome but shows a gap in the alignment. This is because the reference sequence in Ensembl displays a minor allele. UniProt keeps the 404 residue sequence because there is good evidence for it and it is the major allele.”
The coordination of linking genomic and protein information has been present between EMBL-EBI resources for a long time. GIFTS allows this work to be opened up so researchers can build on this information and also provide the detailed feedback needed to further improve EMBL-EBI services.
“The GIFTS service will create a bridge between the two broad worlds of nucleotides and amino acids at EMBL-EBI,” says Daniel Zerbino, Team Leader at EMBL-EBI. “Increasingly, researchers are cross-examining genotype data with protein function to better hone in on causal variants, and GIFTS will provide them with a central database to perform these mappings. It will also help enrich our existing data resources such as Ensembl, UniProt, Intact and PDBe by increasing the consistency between these services.”
GIFTS is just the first step towards linking existing data resources. Making many different types of data available in a single search will make research faster and more convenient for many of the scientists using EMBL-EBI data resources.
Image: Ensembl and UniProt data combine to make GIFTS.
Credit: Spencer Phillips