Latest Articles
Semantic design of functional de novo genes from a genomic language model
Generative genomic models can design increasingly complex biological systems1. However, controlling these models to generate novel sequences with desired functions remains challenging. Here, we show that Evo, a genomic language model, can leverage genomic context to perform function-guided design that accesses novel regions of sequence space. By learning semantic relationships across prokaryotic genes2, Evo enables a genomic ‘autocomplete’ in which a DNA prompt encoding genomic context for a function of interest guides the generation of novel sequences enriched for related functions, which we refer to as ‘semantic design’. We validate this approach by experimentally testing the activity of generated anti-CRISPR proteins and type II and III toxin–antitoxin systems, including de novo genes with no significant sequence similarity to natural proteins. In-context design of proteins and non-coding RNAs with Evo achieves robust activity and high experimental success rates even in the absence of structural priors, known evolutionary conservation or task-specific fine-tuning. We then use Evo to complete millions of prompts to produce SynGenome, a database containing over 120 billion base pairs of artificial intelligence-generated genomic sequences that enables semantic design across many functions. More broadly, these results demonstrate that generative genomics with biological language models can extend beyond natural sequences.
Enzyme specificity prediction using cross attention graph neural networks
Enzymes are the molecular machines of life, and a key property that governs their function is substrate specificity—the ability of an enzyme to recognize and selectively act on particular substrates. This specificity originates from the three-dimensional (3D) structure of the enzyme active site and complicated transition state of the reaction1,2. Many enzymes can promiscuously catalyze reactions or act on substrates beyond those for which they were originally evolved1,3-5. However, millions of known enzymes still lack reliable substrate specificity information, impeding their practical applications and comprehensive understanding of the biocatalytic diversity in nature. Herein, we developed a cross-attention-empowered SE(3)-equivariant graph neural network architecture named EZSpecificity for predicting enzyme substrate specificity, which was trained on a comprehensive tailor-made database of enzyme-substrate interactions at sequence and structural levels. EZSpecificity outperformed the existing machine learning models for enzyme substrate specificity prediction, as demonstrated by both an unknown substrate and enzyme database and seven proof-of-concept protein families. Experimental validation with eight halogenases and 78 substrates revealed that EZSpecificity achieved a 91.7% accuracy in identifying the single potential reactive substrate, significantly higher than that of the state-of-the-art model ESP (58.3%). EZSpecificity represents a general machine learning model for accurate prediction of substrate specificity for enzymes related to fundamental and applied research in biology and medicine.