Page Loader
Meta AI creates the largest metagenomic protein database
Meta AI's latest database contains structure predictions for 617 million metagenomic proteins

Meta AI creates the largest metagenomic protein database

Nov 03, 2022
03:10 am

What's the story

Ever wondered what proteins in the ocean, or in your own body were made up of? Meta AI's latest database might have an answer. The ESM Metagenomic Atlas is the first of its kind, comprising more than 600 million metagenomic protein structures. Discovering new metagenomic proteins from this repository might aid in curing diseases, cleaning the environment, and producing cleaner energy.

Context

Why does this story matter?

Structures of billions of new proteins have already been documented in other databases led by NCBI, Joint Genome Institue, and European Bioinformatics Institute. So what's different about Meta AI's new database? The novelty lies in their language model which provides the 'first comprehensive view of the structures of proteins in a metagenomics database at the scale of hundreds of millions of proteins.'

Twitter Post

Take a look at the official announcement

Definition

What is metagenomics?

According to National Human Research Institute, metagenomics is defined as the "study of the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms (typically microbes) in a bulk sample." In other words, it involves the study of a specific community of microorganisms, such as those residing on human skin, in the soil, or in a water sample.

Information

Amino acids are denoted by a specific character

Proteins are complex molecules made up of building blocks called amino acids. Generally, there are 20 different amino acids. Just as how essays contain words, proteins contain a sequence of characters, with each character denoting a specific amino acid.

Development

The language model was developed by experimenting with various proteins

"Using a form of self-supervised learning known as masked language modeling, we trained a language model on the sequences of millions of natural proteins," explains Meta AI. "We trained a language model to fill in the blanks in a protein sequence, like "GL_KKE_AHY_G" across millions of diverse proteins. We found that information about the structure and function of proteins emerges from this training."

Advancements

The latest language model enables structure prediction with high resolution

Evolutionary scale modeling (ESM) utilizes AI to read protein sequences. These language models can pick up the properties of proteins including their structure and function. ESM1b, which was released in 2020, has been employed to predict the evolution of COVID-19 and to determine the genetic causes of diseases. It has been scaled to ESM-2, the next-generation version. This prediction model offers an atomic-scale resolution.

Features

What is unique about the ESM Metagenomic Atlas?

Meta AI claims that ESM Metagenomic Atlas is the first database to provide a comprehensive record of metagenomic proteins. Further, it is the largest known database of protein structures predicted with high resolution. The Atlas is three times larger than existing protein databases. Furthermore, the company's novel protein-folding technique, ESMFold, can make predictions sixty times faster than the current approaches.

Official words

The enormous database might be a significant tool for researchers

"The ESM Metagenomic Atlas will enable scientists to search and analyze the structures of metagenomic proteins at the scale of hundreds of millions of proteins," said Meta AI in its official blog post. "This can help researchers to identify structures that have not been characterized before, search for distant evolutionary relationships, and discover new proteins that can be useful in medicine and other applications."

Information

Their language model works several times faster than existing ones

"This new structure prediction capability enabled us to predict sequences for the more than 600 million metagenomic proteins in the atlas in just two weeks on a cluster of approximately 2,000 GPUs," states Meta AI, crediting the speed of their prediction algorithm.