Alibaba Cloud’s AI Technology Sparks Breakthrough in RNA Virus Discovery

Main Content

Alibaba Cloud’s AI Technology Sparks Breakthrough in RNA Virus Discovery

  • The tool led to discovery of +160,000 potential RNA virus species and 180 RNA virus supergroups
  • Deep learning algorithms can be applied to fields like microbial genomics and epidemiology

Photo credit: Shutterstock

How we identify new viruses is about to get faster and more precise, thanks to groundbreaking research led by Alibaba Cloud in collaboration with China’s Sun Yat-sen University and The University of Sydney in Australia.

In a study published on Oct. 9 in the peer-reviewed scientific journal Cell, the trio unveiled a deep-learning algorithm that uses artificial intelligence to detect RNA viruses.

LucaProt, as the model is known, improves the viral discovery process by analyzing samples for protein sequences and structural features, helping researchers discover thousands of potential viruses.

“The success of this model not only paves the way for further applications of artificial intelligence in microbiology but also sets a new standard for data-driven discovery in biological sciences,” said Yong He, algorithm expert at Alibaba Cloud’s Apsara Lab and co-author of the recent research paper.

The study is the largest virus species discovery ever published in terms of the number of species identified, and has given rise to a powerful healthcare tool.

LucaProt has excellent accuracy (0.014% false positives) and specificity (1.72% false negatives) on test data sets, leading to the discovery of 160,000 potential RNA virus species and 180 RNA virus supergroups.

“This research is part of Alibaba Cloud’s ongoing efforts to collaborate with academic and research institutions to harness the power of cloud computing and AI in driving the innovation and discovery in science research,” Yong added.

A Major Milestone

RNA viruses are the cause of certain diseases which makes understanding the evolution of these viruses vital for public health.

But traditional methods of virus identification that rely on sequence similarity can overlook unknown viruses, putting an entire realm of viruses out of reach – until now.

During the study, researchers applied LucaProt to a data set of 10,487 metatranscriptomes from diverse ecological systems and watched as the algorithm identified 161,979 potential new virus species and 180 new RNA virus supergroups.

The newly-discovered viruses were present in a range of environments, from air to hot springs to hydrothermal vents, and virus diversity and abundance varied substantially among them all.

“We have been offered a window into an otherwise hidden part of life on earth, revealing remarkable biodiversity,” said Edward Holmes, a senior author on the paper and a professor at the School of Medical Sciences in the Faculty of Medicine and Health at the University of Sydney.

Applications and Implications

Leveraging AI, researchers are able to go beyond the constraints of traditional methods to embrace a more dynamic, comprehensive, and precise approach to virus discovery.

“We used to rely on tedious bioinformatics pipelines for virus discovery, which limited the diversity we could explore. Now, we have a much more effective AI-based model that offers exceptional sensitivity and specificity, and at the same time allows us to delve much deeper into viral diversity,” paper co-author Mang Shi said.

With new viruses being discovered in some of the most extreme environments on the planet, it is clear that RNA viruses are not only more diverse than previously thought but also more resilient and adaptable.

“To find this many new viruses in one fell swoop is mind-blowing, and it just scratches the surface, opening up a world of discovery. There are millions more to be discovered, and we can apply this same approach to identifying bacteria and parasites,” Holmes said.

To aide this mission, Alibaba Cloud has unveiled several AI-driven discovery tools, which can identify patterns and features in complex data in fields like microbial genomics and epidemiology.

The specialized models include a unified base model for nucleic acids and proteins named LucaOne, and a protein language model named LucaPcycle.

Discover more emerging tech stories

Reuse this content

Sign Up For Our Newsletter

Stay updated on the digital economy with our free weekly newsletter