N Nature Methods · Nov 24, 2025 TIRTL-seq: deep, quantitative and affordable paired TCR repertoire sequencing The specificity of T cells is determined by T cell receptor (TCR) α and β chain sequences. While bulk TCR sequencing enables cost-effective repertoire profiling without chain pairing information, single-cell approaches provide paired data but are costly and limited in throughput. Here we present throughput-intensive rapid TCR library sequencing (TIRTL-seq), an experimental and computational methodology for paired TCR repertoire sequencing (TCR-seq). TIRTL-seq is based on the parallel generation of hundreds of TCR libraries in 384-well plates at less than US$200 per plate, allowing cohort-scale paired TCR-seq studies. We benchmarked TIRTL-seq against state-of-the-art bulk TCR-seq and 10x Genomics Chromium technologies on longitudinal samples and identified severe acute respiratory syndrome coronavirus 2- and Epstein–Barr virus-specific clonal expansions after infection with distinct dynamics. TIRTL-seq offers a universal protocol scalable from a single cell to millions of T cells per sample, simultaneously delivering both precise clonal frequency estimation and accurate TCR chain pairing, combining the strengths of bulk and single-cell TCR-seq. TIRTL-seq is a high-throughput method for paired T cell receptor sequencing at the cohort scale. Adaptive immunity Immunological techniques Sequencing Software Systems biology biology
N Nature Genetics · Nov 20, 2025 Scalable and accurate rare variant meta-analysis with Meta-SAIGE Meta-analysis enhances the power of rare variant association tests by combining summary statistics across several cohorts. However, existing methods often fail to control type I error for low-prevalence binary traits and are computationally intensive. Here we introduce Meta-SAIGE—a scalable method for rare variant meta-analysis that accurately estimates the null distribution to control type I error and reuses the linkage disequilibrium matrix across phenotypes to boost computational efficiency in phenome-wide analyses. Simulations using UK Biobank whole-exome sequencing data show that Meta-SAIGE effectively controls type I error and achieves power comparable to pooled individual-level analysis with SAIGE-GENE+. Applying Meta-SAIGE to 83 low-prevalence phenotypes in UK Biobank and All of Us whole-exome sequencing data identified 237 gene–trait associations. Notably, 80 of these associations were not significant in either dataset alone, underscoring the power of our meta-analysis. Bioinformatics Genetics research Genome-wide association studies Software
N Nature Methods · Nov 18, 2025 ImmunoMatch learns and predicts cognate pairing of heavy and light immunoglobulin chains The development of stable antibodies formed by compatible heavy (H) and light (L) chain pairs is crucial in both in vivo maturation of antibody-producing cells and ex vivo designs of therapeutic antibodies. We present ImmunoMatch, a machine-learning framework trained on paired H and L sequences from human B cells to identify molecular features underlying chain compatibility. ImmunoMatch distinguishes cognate from random H–L pairs and captures differences associated withκandλlight chains, reflecting B cell selection mechanisms in the bone marrow. We apply ImmunoMatch to reconstruct paired antibodies from spatial VDJ sequencing data and study the refinement of H–L pairing across B cell maturation stages in health and disease. We find further that ImmunoMatch is sensitive to sequence differences at the H–L interface. These insights provide a computational lens into the broader biological principles governing antibody assembly and stability. Adaptive immunity Lymphocytes Machine learning Software
N Nature Methods · Nov 13, 2025 Bin Chicken: targeted metagenomic coassembly for the efficient recovery of novel genomes The recovery of microbial genomes from metagenomic datasets has provided genomic representation for hundreds of thousands of species from diverse biomes. However, low-abundance microorganisms are often missed due to insufficient genomic coverage. Here we present Bin Chicken, an algorithm that substantially improves genome recovery through automated, targeted selection of metagenomes for coassembly based on shared marker gene sequences derived from raw reads. Marker gene sequences that are divergent from known reference genomes can be further prioritized, providing an efficient means of recovering highly novel genomes. Applying Bin Chicken to public metagenomes and coassembling 800 sample groups recovered 77,562 microbial genomes, including the first genomic representatives of 6 phyla, 41 classes and 24,028 species. These genomes expand the genomic tree of life and uncover a wealth of novel microbial lineages for further research. Data mining Genome informatics Metagenomics Microbial genetics Software
N Nature Genetics · Nov 12, 2025 Computationally efficient meta-analysis of gene-based tests using summary statistics in large-scale genetic studies Meta-analysis of gene-based tests using single-variant summary statistics is a powerful strategy for genetic association studies. However, current approaches require sharing the covariance matrix between variants for each study and trait of interest. For large-scale studies with many phenotypes, these matrices can be cumbersome to calculate, store and share. Here, to address this challenge, we present REMETA—an efficient tool for meta-analysis of gene-based tests. REMETA uses a single sparse covariance reference file per study that is rescaled for each phenotype using single-variant summary statistics. We develop new methods for binary traits with case–control imbalance, and to estimate allele frequencies, genotype counts and effect sizes of burden tests. We demonstrate the performance and advantages of our approach through meta-analysis of five traits in 469,376 samples in UK Biobank. The open-source REMETA software will facilitate meta-analysis across large-scale exome sequencing studies from diverse studies that cannot easily be combined. Genome-wide association studies Software
N Nature Biotechnology · Nov 11, 2025 Multimodal learning enables chat-based exploration of single-cell data Single-cell sequencing characterizes biological samples at unprecedented scale and detail, but data interpretation remains challenging. Here, we present CellWhisperer, an artificial intelligence (AI) model and software tool for chat-based interrogation of gene expression. We establish a multimodal embedding of transcriptomes and their textual annotations, using contrastive learning on 1 million RNA sequencing profiles with AI-curated descriptions. This embedding informs a large language model that answers user-provided questions about cells and genes in natural-language chats. We benchmark CellWhisperer’s performance for zero-shot prediction of cell types and other biological annotations and demonstrate its use for biological discovery in a meta-analysis of human embryonic development. We integrate a CellWhisperer chat box with the CELLxGENE browser, allowing users to interactively explore gene expression through a combined graphical and chat interface. In summary, CellWhisperer leverages large community-scale data repositories to connect transcriptomes and text, thereby enabling interactive exploration of single-cell RNA-sequencing data with natural-language chats. Gene regulation in immune cells Machine learning Preclinical research Software Transcriptomics
N Nature Methods · Nov 11, 2025 Universal consensus 3D segmentation of cells from 2D segmented stacks Cell segmentation is the foundation of a wide range of microscopy-based biological studies. Deep learning has revolutionized two-dimensional (2D) cell segmentation, enabling generalized solutions across cell types and imaging modalities. This has been driven by the ease of scaling up image acquisition, annotation and computation. However, three-dimensional (3D) cell segmentation, requiring dense annotation of 2D slices, still poses substantial challenges. Manual labeling of 3D cells to train broadly applicable segmentation models is prohibitive. Even in high-contrast images annotation is ambiguous and time-consuming. Here we develop a theory and toolbox, u-Segment3D, for 2D-to-3D segmentation, compatible with any 2D method generating pixel-based instance cell masks. u-Segment3D translates and enhances 2D instance segmentations to a 3D consensus instance segmentation without training data, as demonstrated on 11 real-life datasets, comprising >70,000 cells, spanning single cells, cell aggregates and tissue. Moreover, u-Segment3D is competitive with native 3D segmentation, even exceeding when cells are crowded and have complex morphologies. Cellular imaging Image processing Machine learning Software
N Nature Methods · Nov 07, 2025 Monod: model-based discovery and integration through fitting stochastic transcriptional dynamics to single-cell sequencing data Single-cell RNA sequencing analysis centers on illuminating cell diversity and understanding the transcriptional mechanisms underlying cellular function. These datasets are large, noisy and complex. Current analyses prioritize noise removal and dimensionality reduction to tackle these challenges and extract biological insight. We propose an alternative, physical approach to leverage the stochasticity, size and multimodal nature of these data to explicitly distinguish their biological and technical facets while revealing the underlying regulatory processes. With the Python package Monod, we demonstrate how nascent and mature RNA counts, present in most published datasets, can be meaningfully ‘integrated’ under biophysical models of transcription. By using variation in these modalities, we can identify transcriptional modulation not discernible through changes in average gene expression, quantitatively compare mechanistic hypotheses of gene regulation, analyze transcriptional data from different technologies within a common framework and minimize the use of opaque or distortive normalization and transformation techniques. Computational biophysics Computational models Software Transcriptomics
N Nature Aging · Nov 04, 2025 A unified framework for systematic curation and evaluation of aging biomarkers Aging biomarkers are essential tools for quantifying biological aging, but systematic validation has been hindered by methodological inconsistencies and fragmented datasets. Here we show that the ability of traditional aging clocks to predict chronological age does not correlate with mortality prediction capacity (R= 0.12,P= 0.67), suggesting that these metrics capture distinct biological processes. We developed Biolearn, an open-source framework enabling standardized evaluation of 39 biomarkers across over 20,000 individuals from diverse cohorts. The Horvath skin and blood clock achieved the highest chronological age accuracy (R2= 0.88), while GrimAge2 demonstrated the strongest mortality association (hazard ratio = 2.57) and healthspan prediction (hazard ratio = 2.00). Our systematic evaluation reveals considerable heterogeneity in biomarker performance across different clinical outcomes, with optimal biomarkers varying according to specific application. Biolearn provides unified data processing pipelines with quality control and cell-type deconvolution capabilities, establishing a foundation for reproducible aging research and facilitating development of robust aging biomarkers. Computational models Software Systems biology
N Nature Biotechnology · Nov 04, 2025 KATMAP infers splicing factor activity and regulatory targets from knockdown data Typical RNA sequencing (RNA-seq) experiments uncover hundreds of splicing changes, reflecting underlying changes in splicing factor (SF) activity. Understanding how SF activity influences transcriptomic variation requires elucidating how each SF impacts splicing. Here, we present an interpretable regression model, KATMAP, which models splicing changes throughout the transcriptome by analyzing changes in SF binding and the resulting alterations in RNA processing. To learn a regulatory model, KATMAP requires SF perturbation RNA-seq data and the SF’s binding motif as inputs, returning a description of the SF’s position-specific regulatory activity and predicted targets. The KATMAP software includes models pretrained on ENCODE SF knockdown data. Learned KATMAP models can be applied to predict SF regulation andcis-elements at individual exons, which can guide the design of splice-switching antisense oligonucleotides. KATMAP can also interpret RNA-seq data by uncovering the factors responsible for transcriptomic changes, distinguishing direct SF targets from indirect effects and inferring relevant SFs from clinical RNA-seq data. Computational models Software Transcriptomics
N Nature Methods · Nov 03, 2025 STORIES: learning cell fate landscapes from spatial transcriptomics using optimal transport In dynamic biological processes such as development, spatial transcriptomics is revolutionizing the study of the mechanisms underlying spatial organization within tissues. Inferring cell fate trajectories from spatial transcriptomics profiled at several time points has thus emerged as a critical goal, requiring novel computational methods. Wasserstein gradient flow learning is a promising framework for analyzing sequencing data across time, built around a neural network representing the differentiation potential. However, existing gradient flow learning methods face challenges in analyzing spatially resolved transcriptomic data. Here, we propose STORIES, a method that uses an extension of Optimal Transport to learn a spatially informed potential. We benchmark our approach using three large Stereo-seq spatiotemporal atlases and demonstrate superior spatial coherence compared to existing approaches. Finally, we provide an in-depth analysis of axolotl neural regeneration and mouse gliogenesis, recovering gene trends for known markers such asNptx1in neuron regeneration andAldh1l1in gliogenesis and additional putative drivers. Computational models Differentiation Software Transcriptomics
N Nature Methods · Oct 30, 2025 Nicheformer: a foundation model for single-cell and spatial omics Tissue makeup depends on the local cellular microenvironment. Spatial single-cell genomics enables scalable and unbiased interrogation of these interactions. Here we introduce Nicheformer, a transformer-based foundation model trained on both human and mouse dissociated single-cell and targeted spatial transcriptomics data. Pretrained on SpatialCorpus-110M, a curated collection of over 57 million dissociated and 53 million spatially resolved cells across 73 tissues on cellular reconstruction, Nicheformer learns cell representations that capture spatial context. It excels in linear-probing and fine-tuning scenarios for a newly designed set of downstream tasks, in particular spatial composition prediction and spatial label prediction. Critically, we show that models trained only on dissociated data fail to recover the complexity of spatial microenvironments, underscoring the need for multiscale integration. Nicheformer enables the prediction of the spatial context of dissociated cells, allowing the transfer of rich spatial information to scRNA-seq datasets. Overall, Nicheformer sets the stage for the next generation of machine-learning models in spatial single-cell analysis. Computational models Machine learning Software Transcriptomics
N Nature Methods · Oct 29, 2025 Annotating the genome at single-nucleotide resolution with DNA foundation models Genome annotation models that directly analyze DNA sequences are indispensable for modern biological research, enabling rapid and accurate identification of genes and other functional elements. Current annotation tools are typically developed for specific element classes and trained from scratch using supervised learning on datasets that are often limited in size. Here we frame the genome annotation problem as multilabel semantic segmentation and introduce a methodology for fine-tuning pretrained DNA foundation models to segment 14 different genic and regulatory elements at single-nucleotide resolution. We leverage the self-supervised pretrained model Nucleotide Transformer to develop a general segmentation model, SegmentNT, capable of processing DNA sequences up to 50-kb long and that achieves state-of-the-art performance on gene annotation, splice site and regulatory elements detection. We also integrated in our framework the foundation models Enformer and Borzoi, extending the sequence context up to 500 kb and enhancing performance on regulatory elements. Finally, we show that a SegmentNT model trained on human genomic elements generalizes to different species, and a multispecies SegmentNT model achieves strong generalization across unseen species. Our approach is readily extensible to additional models, genomic elements and species. Genomics Machine learning Software
N Nature Methods · Oct 27, 2025 Improved reconstruction of single-cell developmental potential with CytoTRACE 2 While single-cell RNA sequencing has advanced our understanding of cell fate, identifying molecular hallmarks of potency—a cell’s ability to differentiate into other cell types—remains a challenge. Here we introduce CytoTRACE 2, an interpretable deep learning framework for predicting absolute developmental potential from single-cell RNA sequencing data. Across diverse platforms and tissues, CytoTRACE 2 outperformed previous methods in predicting developmental hierarchies, enabling detailed mapping of single-cell differentiation landscapes and expanding insights into cell potency. Cancer genomics Machine learning Software Stem cells Transcriptomics
N Nature Methods · Oct 22, 2025 scooby: modeling multimodal genomic profiles from DNA sequence at single-cell resolution Understanding how regulatory sequences shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA sequencing and epigenomic profiling provides opportunities to build models capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multimodal technologies. Here, we introduce scooby, a framework to model genomic profiles of single-cell RNA-sequencing coverage and single-cell assay for transposase-accessible chromatin using sequencing insertions from sequence at single-cell resolution. For this, we leverage the pretrained multiomics profile predictor Borzoi and equip it with a cell-specific decoder. Scooby recapitulates cell-specific expression levels of held-out genes and identifies regulators and their putative target genes. Moreover, scooby allows resolving single-cell effects of bulk expression quantitative trait loci and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells. Computational models Machine learning Software Transcriptomics
N Nature Methods · Oct 20, 2025 CELLECT: contrastive embedding learning for large-scale efficient cell tracking Quantitative analysis of large-scale cellular behaviors plays an increasingly crucial role in understanding mechanisms of diverse physiopathological processes, but achieving cell tracking with both high performance and efficiency in practical applications remains a challenge. Here we introduce CELLECT, a contrastive embedding learning method for large-scale efficient cell tracking, and demonstrate it on theCaenorhabditis elegansdataset in the Cell Tracking Challenge. By contrastive learning of latent embeddings of diverse cellular structures, a CELLECT model pretrained on a single public dataset can be effectively applied across different imaging modalities and species with broad generalization. Using advanced two-photon imaging, CELLECT enables real-time 3D tracking of large-scale B cells with frequent divisions during germinal center formation in a mouse lymph node, quantitative identification of cell–bacterium interactions in the mouse spleen and high-fidelity extraction of neural signals during strong nonrigid motions. We believe that these results demonstrate broad applications of CELLECT in immunology, pathology and neuroscience. Fluorescence imaging Lymphocytes Software Systems biology
N Nature Genetics · Oct 17, 2025 Genotyping sequence-resolved copy number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes Copy number variable (CNV) genes are important in evolution and disease, yet their sequence variation remains a blind spot in large-scale studies. We present ctyper, a method that leverages pangenomes to produce allele-specific copy numbers with locally phased variants from next-generation sequencing samples. Benchmarking on 3,351 CNV genes and 212 challenging medically relevant (CMR) genes, ctyper captures 96.5% of phased variants with ≥99.1% correctness of copy number in CNV genes and 94.8% of phased variants in CMR genes. Ctyper takes 1.5 h to genotype a genome on one CPU. The ctyper genotypes give a 4.81-fold improvement in predictions of gene expression compared to known expression quantitative trait locus (eQTL) variants. Allele-specific expression quantified divergent expression in 7.94% of paralogs and tissue-specific biases in 4.68%. We found reduced expression ofSMN2due toSMN1conversion, potentially affecting spinal muscular atrophy, and increased expression of translocated duplications ofAMY2B. Overall, ctyper enables biobank-scale genotyping of CNV and CMR genes. Gene expression Genomics Sequence annotation Software
N Nature Methods · Oct 15, 2025 gReLU: a comprehensive framework for DNA sequence modeling and design Deep learning models trained on DNA sequences can predict cell-type-specific regulatory activity, reveal cis-regulatory grammar, prioritize genetic variants and design synthetic DNA. However, building and interpreting these models correctly remains difficult, and models and software built by different groups are often not interoperable. Here we present gReLU, a comprehensive software framework that enables advanced sequence modeling pipelines, including data preprocessing, modeling, evaluation, interpretation, variant effect prediction and regulatory element design. gReLU advances deep-learning-based modeling and analysis of DNA sequences with comprehensive toolsets and versatile applications. Genomics Machine learning Software
N Nature Biotechnology · Oct 15, 2025 Predicting functions of uncharacterized gene products from microbial communities The majority of genes in microbial communities remain uncharacterized. Here we develop a method to infer putative function for microbial proteins at scale by assessing community-wide multiomics data. We predict high-confidence functions for >443,000 protein families (~82.3% previously uncharacterized), including >27,000 protein families with weak homology to known proteins and >6,000 protein families without homology. These were drawn from 1,595 gut metagenomes and 800 metatranscriptomes from the Integrative Human Microbiome Project (HMP2/iHMP). Integrating additional information such as sequence similarity, genomic proximity and domain–domain interactions improves performance of the method. Our method’s implementation, FUGAsseM, is generalizable and predicts protein function in both well-studied and undercharacterized communities. FUGAsseM achieves similar levels of accuracy in the context of microbial communities when compared to state-of-the-art approaches designed for application to single organisms while simultaneously providing much greater breadth of coverage. This initial study expands the functional landscape of the human gut microbiome and allows for exploration of microbial proteins in undercharacterized communities. Data integration Gene expression Microbiome Protein function predictions Software
N Nature Methods · Oct 13, 2025 Multitask benchmarking of single-cell multimodal omics integration methods Single-cell multimodal omics technologies have empowered the profiling of complex biological systems at a resolution and scale that were previously unattainable. These biotechnologies have propelled the fast-paced innovation and development of data integration methods, leading to a critical need for their systematic categorization, evaluation and benchmarking. Navigating and selecting the most pertinent integration approach poses a considerable challenge, contingent upon the tasks relevant to the study goals and the combination of modalities and batches present in the data at hand. Understanding how well each method performs multiple tasks, including dimension reduction, batch correction, cell type classification and clustering, imputation, feature selection and spatial registration, and at which combinations will help guide this decision. Here we develop a much-needed guideline on choosing the most appropriate method for single-cell multimodal omics data analysis through a systematic categorization and comprehensive benchmarking of current methods. The stage 1 protocol for this Registered Report was accepted in principle on 30 July 2024. The protocol, as accepted by the journal, can be found athttps://springernature.figshare.com/articles/journal_contribution/Multi-task_benchmarking_of_single-cell_multimodal_omics_integration_methods/26789902. Computational models Data integration Software Transcriptomics
N Nature Methods · Oct 13, 2025 Deep generative modeling of sample-level heterogeneity in single-cell genomics Single-cell genomic studies were recently conducted on hundred of samples exhibiting complex designs. These data have tremendous potential for discovering how sample- or tissue-level phenotypes relate to cellular and molecular composition. However, current analyses are often based on simplified representations of these data by averaging information across cells. We present multi-resolution variational inference (MrVI), a deep generative model designed to realize the potential of cohort studies at the single-cell level. MrVI tackles two fundamental, intertwined problems: stratifying samples into groups and evaluating the cellular and molecular differences between groups, without requiring predefined cell states. Leveraging its single-cell perspective, MrVI detects clinically relevant stratifications of cohorts of people with COVID-19 or inflammatory bowel disease that are manifested in only certain cellular subsets, enabling new discoveries that would otherwise be overlooked. MrVI can de novo identify groups of small molecules with similar biochemical properties and evaluate their effects on cellular composition and gene expression in large-scale perturbation studies. MrVI is an open-source tool atscvi-tools.org. Machine learning Software Statistical methods Transcriptomics
N Nature Methods · Oct 08, 2025 Automated classification of cellular expression in multiplexed imaging data with Nimbus Multiplexed imaging offers a powerful approach to characterize the spatial topography of tissues in both health and disease. To analyze such data, the specific combination of markers that are present in each cell must be enumerated to enable accurate phenotyping, a process that often relies on unsupervised clustering. We constructed the Pan-Multiplex (Pan-M) dataset containing 197 million distinct annotations of marker expression across 15 different cell types. We used Pan-M to create Nimbus, a deep learning model to predict marker positivity from multiplexed image data. Nimbus is a pretrained model that uses the underlying images to classify marker expression of individual cells as positive or negative across distinct cell types, from different tissues, acquired using different microscope platforms, without requiring any retraining. We demonstrate that Nimbus predictions capture the underlying staining patterns of the full diversity of markers present in Pan-M, and that Nimbus matches or exceeds the accuracy of previous approaches that must be retrained on each dataset. We then show how Nimbus predictions can be integrated with downstream clustering algorithms to robustly identify cell subtypes in image data. We have open-sourced Nimbus and Pan-M to enable community use athttps://github.com/angelolab/Nimbus-Inference. Image processing Machine learning Software
N Nature Methods · Oct 08, 2025 Cell tracking with accurate error prediction Cell tracking is an indispensable tool for studying development by time-lapse imaging. However, existing cell trackers cannot assign confidence to predicted tracks, which prohibits fully automated analysis without manual curation. We present a fundamental advance: an algorithm that combines neural networks with statistical physics to determine cell tracks with error probabilities for each step in the track. From these, we can obtain error probabilities for any tracking feature, from cell cycles to lineage trees, that function likePvalues in data interpretation. Our method, OrganoidTracker 2.0, greatly speeds up tracking analysis by limiting manual curation to rare low-confidence tracking steps. Importantly, it also enables fully automated analysis by retaining only high-confidence track segments, which we demonstrate by analyzing cell cycles and differentiation events at scale for thousands of cells in multiple intestinal organoids. Our approach brings cell dynamics-based organoid screening within reach and enables transparent reporting of cell-tracking results and associated scientific claims. Confocal microscopy Differentiation Image processing Software Statistical methods
N Nature Methods · Oct 01, 2025 Giotto Suite: a multiscale and technology-agnostic spatial multiomics analysis ecosystem Emerging spatial multiomics technologies provide an increasingly large amount of information content at multiple scales. However, it remains challenging to efficiently represent and harmonize diverse spatial datasets. Here we present Giotto Suite, a suite of modular packages that provides scalable and extensible end-to-end solutions for multiscale and multiomic data analysis, integration and visualization. At its core, Giotto Suite is centered around an innovative data framework, allowing the representation and integration of spatial omics data in a technology-agnostic manner. Giotto Suite integrates molecular, morphology, spatial and annotated feature information to create a responsive and flexible workflow, as demonstrated by applications to several state-of-the-art spatial technologies. Furthermore, Giotto Suite builds upon interoperable interfaces and data structures that bridge the established fields of genomics and spatial data science in R, thereby enabling independent developers to create custom-engineered pipelines. As such, Giotto Suite creates an immersive and multiscale ecosystem for spatial multiomic data analysis. Computational platforms and environments Software Transcriptomics
N Nature Methods · Sep 29, 2025 InterPLM: discovering interpretable features in protein language models via sparse autoencoders Despite their success in protein modeling and design, the internal mechanisms of protein language models (PLMs) are poorly understood. Here we present a systematic framework to extract and analyze interpretable features from PLMs using sparse autoencoders. Training sparse autoencoders on ESM-2 embeddings, we identify thousands of interpretable features highlighting biological concepts including binding sites, structural motifs and functional domains. Individual neurons show considerably less conceptual alignment, suggesting PLMs store concepts in superposition. This superposition persists across model scales and larger PLMs capture more interpretable concepts. Beyond known annotations, ESM-2 learns coherent patterns across evolutionarily distinct protein families. To systematically analyze these numerous features, we developed an automated interpretation approach using large language models for feature description and validation. As practical applications, these features can accurately identify missing database annotations and enable targeted steering of sequence generation. Our results show PLM representations can be decomposed into interpretable components, demonstrating the feasibility and utility of mechanistically interpreting these models. Protein analysis Software
N Nature Methods · Sep 25, 2025 Merging conformational landscapes in a single consensus space with FlexConsensus algorithm Structural heterogeneity analysis in cryogenic electron microscopy is experiencing a breakthrough in estimating more accurate, richer and interpretable conformational landscapes derived from experimental data. The emergence of new methods designed to tackle the heterogeneity challenge reflects this new paradigm, enabling users to gain a better understanding of protein dynamics. However, the question of how intrinsically different heterogeneity algorithms compare remains unsolved, which is crucial for determining the reliability, stability and correctness of the estimated conformational landscapes. Here, to overcome the previous challenge, we introduce FlexConsenus: a multi-autoencoder neural network able to learn the commonalities and differences among several conformational landscapes, enabling them to be placed in a shared consensus space with enhanced reliability. The consensus space enables the measurement of reproducibility in heterogeneity estimations, allowing users to either focus their analysis on particles with a stable estimation of their structural variability or concentrate on specific particle subsets detected by only certain methods. Image processing Machine learning Software
N Nature Methods · Sep 25, 2025 EpiAgent: foundation model for single-cell epigenomics Although single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) enables the exploration of the epigenomic landscape that governs transcription at the cellular level, the complicated characteristics of the sequencing data and the broad scope of downstream tasks mean that a sophisticated and versatile computational method is urgently needed. Here we introduce EpiAgent, a foundation model pretrained on our manually curated large-scale Human-scATAC-Corpus. EpiAgent encodes chromatin accessibility patterns of cells as concise ‘cell sentences’ and captures cellular heterogeneity behind regulatory networks via bidirectional attention. Comprehensive benchmarks show that EpiAgent excels in typical downstream tasks, including unsupervised feature extraction, supervised cell type annotation and data imputation. By incorporating external embeddings, EpiAgent enables effective cellular response prediction for both out-of-sample stimulated and unseen genetic perturbations, reference data integration and query data mapping. Through in silico knockout ofcis-regulatory elements, EpiAgent demonstrates the potential to model cell state changes. EpiAgent is further extended to directly annotate cell types in a zero-shot manner. Computational models Data integration Machine learning Software
N Nature Microbiology · Sep 19, 2025 Phages with a broad host range are common across ecosystems Phages are diverse and abundant within microbial communities, where they play major roles in their evolution and adaptation. Phage replication, and multiplication, is generally thought to be restricted within a single or narrow host range. Here we use published and newly generated proximity-ligation-based metagenomic Hi-C (metaHiC) data from various environments to explore virus–host interactions. We reconstructed 4,975 microbial and 6,572 phage genomes of medium quality or higher. MetaHiC yielded a contact network between genomes and enabled assignment of approximately half of phage genomes to their hosts, revealing that a substantial proportion of these phages interact with multiple species in environments as diverse as the oceanic water column or the human gut. This observation challenges the traditional view of a narrow host spectrum of phages by unveiling that multihost associations are common across ecosystems, with implications for how they might impact ecology and evolution and phage therapy approaches. Environmental microbiology Software Virology
N Nature Methods · Sep 18, 2025 GPU-accelerated homology search with MMseqs2 Rapidly growing protein databases demand faster sensitive search tools. Here the graphics processing unit (GPU)-accelerated MMseqs2 delivers 6× faster single-protein searches than CPU methods on 2 × 64 cores, speeds previously requiring large protein batches. For larger query batches, it is the most cost-effective solution, outperforming the fastest alternative method by 2.4-fold with eight GPUs. It accelerates protein structure prediction with ColabFold 31.8× over the standard AlphaFold2 pipeline and protein structure search with Foldseek by 4–27×. MMseqs2-GPU is available under an open-source license athttps://mmseqs.com/. Hardware and infrastructure Protein analysis Protein function predictions Protein structure predictions Software
N Nature Methods · Sep 15, 2025 Cancer subclone detection based on DNA copy number in single-cell and spatial omic sequencing data Somatic mutations such as copy number alterations accumulate during cancer progression, driving intratumor heterogeneity that impacts therapy effectiveness. Understanding the characteristics and spatial distribution of genetically distinct subclones is essential for unraveling tumor evolution and improving cancer treatment. Here we present Clonalscope, a subclone detection method using copy number profiles, applicable to spatial transcriptomics and single-cell sequencing data. Clonalscope implements a nested Chinese Restaurant Process to identify de novo tumor subclones, which can incorporate prior information from matched bulk DNA sequencing data for improved subclone detection and malignant cell labeling. On single-cell RNA sequencing and single-cell assay for transposase-accessible chromatin using sequencing data from gastrointestinal tumors, Clonalscope successfully labeled malignant cells and identified genetically different subclones with thorough validations. On spatial transcriptomics data from various primary and metastasized tumors, Clonalscope labeled malignant spots, traced subclones and identified spatially segregated subclones with distinct differentiation levels and expression of genes associated with drug resistance and survival. Cancer genomics Genomics Software Statistical methods Tumour heterogeneity
N Nature Methods · Sep 15, 2025 De novo discovery of conserved gene clusters in microbial genomes with Spacedust Metagenomics has revolutionized environmental and human-associated microbiome studies. However, the limited fraction of proteins with known biological processes and molecular functions presents a major bottleneck. In prokaryotes and viruses, evolution favors keeping genes participating in the same biological processes colocalized as conserved gene clusters. Conversely, conservation of gene neighborhood indicates functional association. Here we present Spacedust, a tool for systematic, de novo discovery of conserved gene clusters. To find homologous protein matches, Spacedust uses fast and sensitive structure comparison with Foldseek. Partially conserved clusters are detected using novel clustering and order conservationPvalues. We demonstrate Spacedust’s sensitivity with an all-versus-all analysis of 1,308 bacterial genomes, identifying 72,843 conserved gene clusters containing 58% of the 4.2 million genes. It recovered 95% of antiviral defense system clusters annotated by the specialized tool PADLOC. Spacedust’s high sensitivity and speed will facilitate the annotation of large numbers of sequenced bacterial, archaeal and viral genomes. Genome informatics Metagenomics Software
N Nature Genetics · Sep 15, 2025 Accelerated Bayesian inference of population size history from recombining sequence data This study introduces population history learning by averaging sampled histories (PHLASH), a new method for inferring population history from whole-genome sequence data. It works by drawing random, low-dimensional projections of the coalescent intensity function from the posterior distribution of a pairwise sequentially Markovian coalescent-like model and averaging them together to form an accurate and adaptive estimator. On simulated data, PHLASH tends to be faster and have lower error than several competing methods, including SMC++, MSMC2 and FITCOAL. Moreover, it provides automatic uncertainty quantification and leads to new Bayesian testing procedures for detecting population structure and ancient bottlenecks. The key technical advance is a new algorithm for computing the score function (gradient of the log likelihood) of a coalescent hidden Markov model, which has the same computational cost as evaluating the log likelihood. PHLASH has been released as an easy-to-use Python software package and leverages graphics processing unit acceleration when available. Population genetics Software
N Nature Biotechnology · Sep 10, 2025 Efficient sequence alignment against millions of prokaryotic genomes with LexicMap The size of microbial sequence databases continues to grow beyond the abilities of existing alignment tools. We introduce LexicMap, a nucleotide sequence alignment tool for efficiently querying moderate-length sequences (>250 bp) such as a gene, plasmid or long read against up to millions of prokaryotic genomes. We construct a small set of probek-mers, which are selected to efficiently sample the entire database to be indexed such that every 250-bp window of each database genome contains multiple seedk-mers, each with a shared prefix with one of the probes. Storing these seeds in a hierarchical index enables fast and low-memory alignment. We benchmark both accuracy and potential to scale to databases of millions of bacterial genomes, showing that LexicMap achieves comparable accuracy to state-of-the-art methods but with greater speed and lower memory use. Our method supports querying at scale and within minutes, which will be useful for many biological applications across epidemiology, ecology and evolution. Bacterial genomics Computational models Genetic databases Genome informatics Software
N Nature Methods · Sep 08, 2025 Scvi-hub: an actionable repository for model-driven single-cell analysis The growing availability of single-cell omics datasets presents new opportunities for reuse, while challenges in data transfer, normalization and integration remain a barrier. Here we present scvi-hub: a platform for efficiently sharing and accessing single-cell omics datasets using pretrained probabilistic models. It enables immediate execution of fundamental tasks like visualization, imputation, annotation and deconvolution on new query datasets using state-of-the-art methods, with massively reduced storage and compute requirements. We show that pretrained models support efficient analysis of large references, including the CZI CELLxGENE Discover Census. Scvi-hub is built within the scvi-tools open-source environment and integrated into scverse. Scvi-hub offers a scalable and user-friendly framework for accessing and contributing to a growing ecosystem of ready-to-use models and datasets, thus putting the power of atlas-level analysis at the fingertips of a broad community of users. Machine learning Software Statistical methods Transcriptomics
N Nature Genetics · Sep 08, 2025 Robust and accurate Bayesian inference of genome-wide genealogies for hundreds of genomes The Ancestral Recombination Graph (ARG), which describes the genealogical history of a sample of genomes, is a vital tool in population genomics and biomedical research. Recent advancements have substantially increased ARG reconstruction scalability, but they rely on approximations that can reduce accuracy, especially under model misspecification. Moreover, they reconstruct only a single ARG topology and cannot quantify the considerable uncertainty associated with ARG inferences. Here, to address these challenges, we introduce SINGER (sampling and inferring of genealogies with recombination), a method that accelerates ARG sampling from the posterior distribution by two orders of magnitude, enabling accurate inference and uncertainty quantification for hundreds of whole-genome sequences. Through extensive simulations, we demonstrate SINGER’s enhanced accuracy and robustness to model misspecification compared to existing methods. We demonstrate the utility of SINGER by applying it to individuals of British and African descent within the 1000 Genomes Project, identifying signals of population differentiation, archaic introgression and strong support for ancient polymorphism in the human leukocyte antigen region shared across primates. Population genetics Software