Until 1957, a group of scientists discovered special access to the molecular third dimension. After 22 years of experimentation, John Kendrew of the Cavendish Laboratory in Cambridge (UK) revealed the first protein structure – myoglobin – using X-ray diffraction to determine the three-dimensional structure.
Together with Kendrew, Max Perutz was honored with the 1962 Nobel Prize for their progress in determining protein structure. In light of a dozen protein structures determined after this discovery, solving the decade-old protein folding problem seemed promising.
The twisted blueprint of myoglobin consisted of a filamentous chain of 154 amino acids, which helps bring oxygen to our muscles. As revolutionary as it may seem, Kendrew had left the floodgates of protein architecture unopened.
After 65 years of the Nobel Prize-winning breakthrough, scientists have now used AI to kick-start this process. based in London Deep Mind revealed the prediction of structures for about 220 million proteins this year — including bacteria, plants, animals and humans — that include nearly every protein known to science.
DeepMind Founder and CEO Demis Hassabis said, “Essentially you can think it encompasses the whole protein universe”. However, another technology conglomerate plans to fill up the dark matter in the same universe and may be more advanced than DeepMind’s prediction.
Dark matter of the universe
Proteins are complex molecules that are responsible for the fundamental processes of life.
One of the new frontiers in natural science is: metagenomics, which uses gene sequencing to discover proteins in samples of microbes that live in the soil, deep in the ocean and even in our guts and skin. But the same group of proteins is the least understood on Earth.
Decoding the structure of metagenomics could help researchers find proteins to cure diseases, produce cleaner energy, and even solve the long-standing mystery of human evolution.
To improve this process, a group of researchers at Meta – using artificial intelligence (AI) – predicted the structures of more than 600 million proteins from viruses, bacteria and other microbes that had not been characterized.
Research leader on Meta AI’s protein team, Alexander Rives, said: “These are the structures we know the least about. These are incredibly mysterious proteins. I think they have the potential for great insight into biology.”
The team has created the first database that reveals the structures of the metagenomic world at the scale of hundreds of millions of proteins. The predictions were made using a ‘large language model’, a basic tool for predicting text from just a few words.
In addition, large amounts of texts are used to train language models. Rives and his team fed the models with sequences of known proteins through a chain of 20 different amino acids, with each acid represented by a letter.
‘Autocomplete’ of proteins
Thanks to advances in gene sequencing, it is now possible to trace billions of metagenomic protein sequences. Researchers may have discovered their sequences, but understanding their biology is a huge challenge.
Determining the three-dimensional structures for millions of proteins in an experiment is a distant goal from the time-consuming techniques (such as X-ray crystallography), which can take weeks or even years for a single protein. Computational techniques provide insight into metagenomics proteins, which is not possible with experimental approaches.
Meta has released the 600+ million proteins ESM Metagenomic Atlas– for the entire MGnify90 database – which acts as a public resource that catalogs the metagenomic sequences. Meta claims, “To our knowledge, this is the largest database of high-resolution predicted structures, 3x larger than any existing protein structure database, and the first to fully encompass metagenomic proteins at scale.”
This benchmark allows innovators to search for and analyze the structures of metagenomic proteins at the scale of hundreds of millions of proteins. This can help search for distant evolutionary relationships, identify protein structures that have not been previously characterized, and discover new structures that could be useful in medicine and other related devices.
The network was trained to ‘autocomplete’ proteins with some of the amino acids obscured.
Rives said, “This training imbued the network with an intuitive understanding of protein sequences, which contain information about their shapes.”
Faster, but not as accurate as AlphaFold
However, the second step in training the protein was inspired by none other than DeepMind’s protein structure called ‘AlphaFold’, which combines insights about the relationships between known protein structures to generate further sequences.
The team decided to shape the model on a database consisting of bulk and sequenced ‘metagenomic’ DNA from sources such as seawater, soil, skin and the human gut, among other habitats present in the environment. The majority of these DNA entries are derived from organisms unknown to science.
Meta undoubtedly predicted the structures of 617 million proteins. The effort lasted two weeks for the team, while AlphaFold only takes a few minutes to generate one prediction.
Meta’s network ‘ESMFold’ is not as accurate as AlphaFold. But Rives’ team reported earlier this year that the model is about 60 times faster at predicting protein structures. “This means we can scale structure prediction to much larger databases,” he said.
Renowned biologist from Harvard University in Cambridge, Massachusetts, Sergey Ovchinnikov, analyzed how ESMFold made millions of predictions with little confidence.
He said some structures may not be well-defined, while other structures might as well be a non-coding DNA mistakenly mistaken for a protein-coding material. He says, “It looks like there’s still more than half of the protein space that we don’t know about.”
sources say DeepMind currently has no plans to include the metagenomic structure predictions in its database. But Meta’s new prediction model allows researchers to study how language models can be used across disciplines, opening doors for new breakthroughs in metagenomic structures in the universe.
However, many have yet to conform.