Latest Articles

45 articles
Active filters: Machine learning ×




N
Nature · Nov 19, 2025

Semantic design of functional de novo genes from a genomic language model

Generative genomic models can design increasingly complex biological systems1. However, controlling these models to generate novel sequences with desired functions remains challenging. Here, we show that Evo, a genomic language model, can leverage genomic context to perform function-guided design that accesses novel regions of sequence space. By learning semantic relationships across prokaryotic genes2, Evo enables a genomic ‘autocomplete’ in which a DNA prompt encoding genomic context for a function of interest guides the generation of novel sequences enriched for related functions, which we refer to as ‘semantic design’. We validate this approach by experimentally testing the activity of generated anti-CRISPR proteins and type II and III toxin–antitoxin systems, including de novo genes with no significant sequence similarity to natural proteins. In-context design of proteins and non-coding RNAs with Evo achieves robust activity and high experimental success rates even in the absence of structural priors, known evolutionary conservation or task-specific fine-tuning. We then use Evo to complete millions of prompts to produce SynGenome, a database containing over 120 billion base pairs of artificial intelligence-generated genomic sequences that enables semantic design across many functions. More broadly, these results demonstrate that generative genomics with biological language models can extend beyond natural sequences.

Computational models Genetic databases Machine learning Protein design






N
Nature Biomedical Engineering · Nov 05, 2025

A pre-trained large generative model for translating single-cell transcriptomes to proteomes

Measuring protein abundance at the single-cell level can facilitate a high-resolution understanding of biological mechanisms in cellular processes and disease progression. However, current single-cell proteomic technologies face challenges such as limited coverage, constrained throughput and sensitivity, batch effects, high costs and stringent experimental operations. Inspired by the translation procedure in both natural language processing and the genetic central dogma, we propose a pre-trained, large generative model named single-cell translator (scTranslator). scTranslator can generate multi-omics data by inferring the missing single-cell proteome based on the transcriptome. Through systematic benchmarking and validation on independent datasets, we have confirmed the accuracy, stability and flexibility of scTranslator across various profiling techniques (for example, CITE-seq, spatial CITE-seq, REAP-seq, NEAT-seq), cell types (for example, monocytes, macrophages, T cells, B cells), tissues (for example, blood, lung, brain) and a wide range of disease contexts, including infectious, metabolic and oncologic conditions. Furthermore, scTranslator shows its superiority in assisting various downstream analyses and applications, including gene/protein interaction inference, perturbation prediction, cell clustering, batch correction and cell origin recognition in pan-cancer data.

Machine learning Proteome informatics



N
Nature Methods · Oct 29, 2025

Annotating the genome at single-nucleotide resolution with DNA foundation models

Genome annotation models that directly analyze DNA sequences are indispensable for modern biological research, enabling rapid and accurate identification of genes and other functional elements. Current annotation tools are typically developed for specific element classes and trained from scratch using supervised learning on datasets that are often limited in size. Here we frame the genome annotation problem as multilabel semantic segmentation and introduce a methodology for fine-tuning pretrained DNA foundation models to segment 14 different genic and regulatory elements at single-nucleotide resolution. We leverage the self-supervised pretrained model Nucleotide Transformer to develop a general segmentation model, SegmentNT, capable of processing DNA sequences up to 50-kb long and that achieves state-of-the-art performance on gene annotation, splice site and regulatory elements detection. We also integrated in our framework the foundation models Enformer and Borzoi, extending the sequence context up to 500 kb and enhancing performance on regulatory elements. Finally, we show that a SegmentNT model trained on human genomic elements generalizes to different species, and a multispecies SegmentNT model achieves strong generalization across unseen species. Our approach is readily extensible to additional models, genomic elements and species.

Genomics Machine learning Software

N
Nature Medicine · Oct 27, 2025

A full life cycle biological clock based on routine clinical data and its impact in health and diseases

Aging research has primarily focused on adult aging clocks, leaving a critical gap in understanding a biological clock across the full life cycle, particularly during infancy and childhood. Here we introduce LifeClock, a biological clock model that predicts biological age across all life stages using routine electronic health records and laboratory test data. To enhance individualized predictions, we integrated virtual patient representations from 24,633,025 heterogeneous longitudinal clinical visits across 9,680,764 individuals and projected them into a latent space. Our approach leverages EHRFormer, a time-series transformer-based model, to analyze developmental and aging dynamics with high precision and develop accurate biological age clocks spanning infancy to old age. Our findings reveal distinct biological clock patterns across different life stages. The pediatric clock is strongly associated with children’s development and accurately predicts current and future risks of major pediatric diseases, including malnutrition, growth and developmental abnormalities. The adult clock is strongly associated with aging and accurately predicts current and future risks of major age-related diseases, such as diabetes, renal failure, stroke and cardiovascular diseases. This work therefore distinguishes pediatric development from adult aging, establishing a novel framework to advance precision health by leveraging routine clinical data across the entire lifespan.

Ageing Data mining Machine learning



N
Nature Biomedical Engineering · Oct 24, 2025

Implanted microelectrode arrays in reinnervated muscles allow separation of neural drives from transferred polyfunctional nerves

Targeted muscle reinnervation surgery reroutes residual nerve signals into spare muscles, enabling the recovery of neural information through electromyography (EMG). However, EMG signals are often overlapping, making the interpretation of limb functions complicated. Regenerative peripheral nerve interfaces surgically partition the nerve into individual fascicles that reinnervate specific muscle grafts, isolating distinct neural sources for precise control and interpretation of EMG signals. Here we combine targeted muscle reinnervation surgery of polyvalent nerves with a high-density microelectrode array implanted at a single site within a reinnervated muscle, and via mathematical source separation methods, we separate all neural signals that are redirected into a single muscle. In participants with upper-limb amputation, the deconvolution of EMG signals from four reinnervated muscles into motor unit spike trains revealed distinct clusters of motor neurons associated with diverse functional tasks. Our method enabled the extraction of multiple neural commands within a single reinnervated muscle, eliminating the need for surgical nerve division. This approach holds promises for enhancing control over prosthetic limbs and for understanding how the central nervous system encodes movement after reinnervation.

Biomedical engineering Computational neuroscience Machine learning Microarrays Motor neuron





N
Nature Methods · Oct 08, 2025

Automated classification of cellular expression in multiplexed imaging data with Nimbus

Multiplexed imaging offers a powerful approach to characterize the spatial topography of tissues in both health and disease. To analyze such data, the specific combination of markers that are present in each cell must be enumerated to enable accurate phenotyping, a process that often relies on unsupervised clustering. We constructed the Pan-Multiplex (Pan-M) dataset containing 197 million distinct annotations of marker expression across 15 different cell types. We used Pan-M to create Nimbus, a deep learning model to predict marker positivity from multiplexed image data. Nimbus is a pretrained model that uses the underlying images to classify marker expression of individual cells as positive or negative across distinct cell types, from different tissues, acquired using different microscope platforms, without requiring any retraining. We demonstrate that Nimbus predictions capture the underlying staining patterns of the full diversity of markers present in Pan-M, and that Nimbus matches or exceeds the accuracy of previous approaches that must be retrained on each dataset. We then show how Nimbus predictions can be integrated with downstream clustering algorithms to robustly identify cell subtypes in image data. We have open-sourced Nimbus and Pan-M to enable community use athttps://github.com/angelolab/Nimbus-Inference.

Image processing Machine learning Software

N
Nature · Oct 08, 2025

Enzyme specificity prediction using cross attention graph neural networks

Enzymes are the molecular machines of life, and a key property that governs their function is substrate specificity—the ability of an enzyme to recognize and selectively act on particular substrates. This specificity originates from the three-dimensional (3D) structure of the enzyme active site and complicated transition state of the reaction1,2. Many enzymes can promiscuously catalyze reactions or act on substrates beyond those for which they were originally evolved1,3-5. However, millions of known enzymes still lack reliable substrate specificity information, impeding their practical applications and comprehensive understanding of the biocatalytic diversity in nature. Herein, we developed a cross-attention-empowered SE(3)-equivariant graph neural network architecture named EZSpecificity for predicting enzyme substrate specificity, which was trained on a comprehensive tailor-made database of enzyme-substrate interactions at sequence and structural levels. EZSpecificity outperformed the existing machine learning models for enzyme substrate specificity prediction, as demonstrated by both an unknown substrate and enzyme database and seven proof-of-concept protein families. Experimental validation with eight halogenases and 78 substrates revealed that EZSpecificity achieved a 91.7% accuracy in identifying the single potential reactive substrate, significantly higher than that of the state-of-the-art model ESP (58.3%). EZSpecificity represents a general machine learning model for accurate prediction of substrate specificity for enzymes related to fundamental and applied research in biology and medicine.

Biocatalysis Machine learning Protein function predictions

N
Nature Methods · Oct 03, 2025

All-at-once RNA folding with 3D motif prediction framed by evolutionary information

Structural RNAs exhibit a vast array of recurrent short three-dimensional (3D) elements found in loop regions involving non-Watson–Crick interactions that help arrange canonical double helices into tertiary structures. Here we present CaCoFold-R3D, a probabilistic grammar that predicts these RNA 3D motifs (also termed modules) jointly with RNA secondary structure over a sequence or alignment. CaCoFold-R3D uses evolutionary information present in an RNA alignment to reliably identify canonical helices (including pseudoknots) by covariation. Here we further introduce the R3D grammars, which also exploit helix covariation that constrains the positioning of the mostly noncovarying RNA 3D motifs. Our method runs predictions over an almost-exhaustive list of over 50 known RNA motifs (‘everything’). Motifs can appear in any nonhelical loop region (including three-way, four-way and higher junctions) (‘everywhere’). All structural motifs as well as the canonical helices are arranged into one single structure predicted by one single joint probabilistic grammar (‘all-at-once’). Our results demonstrate that CaCoFold-R3D is a valid alternative for predicting the all-residue interactions present in a RNA 3D structure. CaCoFold-R3D is fast and easily customizable for novel motif discovery and shows promising value both as a strong input for deep learning approaches to all-atom structure prediction as well as toward guiding RNA design as drug targets for therapeutic small molecules.

Computational models Machine learning Non-coding RNAs Riboswitches




N
Nature Biomedical Engineering · Sep 30, 2025

Brain–heart–eye axis revealed by multi-organ imaging genetics and proteomics

Multi-organ research investigates interconnections among multiple human organ systems, enhancing our understanding of human aging and disease mechanisms. Here we use multi-organ imaging, individual- and summary-level genetics, and proteomics data consolidated via the MULTI Consortium to delineate a brain–heart–eye axis using brain patterns of structural covariance (PSCs), heart imaging-derived phenotypes (IDPs) and eye IDPs. We find that proteome-wide associations of the PSCs and IDPs show within-organ specificity and cross-organ interconnections. Pleiotropic effects of common single-nucleotide polymorphisms are observed across multiple organs, and key genetic parameters are estimated for single-nucleotide polymorphism-based heritability, polygenicity and selection signatures across the three organs. A gene–drug–disease network shows the potential of drug repurposing for cross-organ diseases. Co-localization and causal analyses reveal cross-organ causal relationships between PSC/IDP and chronic diseases, such as Alzheimer’s disease, heart failure and glaucoma. Finally, integrating multi-organ/omics features improves prediction for systemic disease categories and cognition compared with single-organ/omics features, providing future avenues for modelling human aging and disease.

Genetics research Heritable quantitative trait Machine learning











N
Nature Biomedical Engineering · Sep 05, 2025

A generalist foundation model and database for open-world medical image segmentation

Vision foundation models have demonstrated vast potential in achieving generalist medical segmentation capability, providing a versatile, task-agnostic solution through a single model. However, current generalist models involve simple pre-training on various medical data containing irrelevant information, often resulting in the negative transfer phenomenon and degenerated performance. Furthermore, the practical applicability of foundation models across diverse open-world scenarios, especially in out-of-distribution (OOD) settings, has not been extensively evaluated. Here we construct a publicly accessible database, MedSegDB, based on a tree-structured hierarchy and annotated from 129 public medical segmentation repositories and 5 in-house datasets. We further propose a Generalist Medical Segmentation model (MedSegX), a vision foundation model trained with a model-agnostic Contextual Mixture of Adapter Experts (ConMoAE) for open-world segmentation. We conduct a comprehensive evaluation of MedSegX across a range of medical segmentation tasks. Experimental results indicate that MedSegX achieves state-of-the-art performance across various modalities and organ systems in in-distribution (ID) settings. In OOD and real-world clinical settings, MedSegX consistently maintains its performance in both zero-shot and data-efficient generalization, outperforming other foundation models.

Imaging Machine learning





N
Nature Medicine · Aug 20, 2025

AI-based diagnosis of acute aortic syndrome from noncontrast CT

The accurate and timely diagnosis of acute aortic syndrome (AAS) in patients presenting with acute chest pain remains a clinical challenge. Aortic computed tomography (CT) angiography is the imaging protocol of choice in patients with suspected AAS. However, due to economic and workflow constraints in China, the majority of suspected patients initially undergo noncontrast CT as the initial imaging testing, and CT angiography is reserved for those at higher risk. Although noncontrast CT can reveal specific signs indicative of AAS, its diagnostic efficacy when used alone has not been well characterized. Here we present an artificial intelligence-based warning system, iAorta, using noncontrast CT for AAS identification in China, which demonstrates remarkably high accuracy and provides clinicians with interpretable warnings. iAorta was evaluated through a comprehensive step-wise study. In the multicenter retrospective study (n= 20,750), iAorta achieved a mean area under the receiver operating curve of 0.958 (95% confidence interval 0.950–0.967). In the large-scale real-world study (n= 137,525), iAorta demonstrated consistently high performance across various noncontrast CT protocols, achieving a sensitivity of 0.913–0.942 and a specificity of 0.991–0.993. In the prospective comparative study (n= 13,846), iAorta demonstrated the capability to significantly shorten the time to correct diagnostic pathway for patients with initial false suspicion from an average of 219.7 (115–325) min to 61.6 (43–89) min. Furthermore, for the prospective pilot deployment that we conducted, iAorta correctly identified 21 out of 22 patients with AAS among 15,584 consecutive patients presenting with acute chest pain and under noncontrast CT protocol in the emergency department. For these 21 AAS-positive patients, the average time to diagnosis was 102.1 (75–133) min. Finally, iAorta may help prevent delayed or missed diagnoses of AAS in settings where noncontrast CT remains the only feasible initial imaging modality—such as in resource-limited regions or in patients who cannot receive, or did not receive, intravenous contrast.

Aortic diseases Computed tomography Machine learning