N Nature Methods · Dec 04, 2025 Deep Imputation for Skeleton data (DISK) for behavioral science Pose estimation methods and motion capture systems have opened doors to quantitative measurements of animal kinematics. While animal behavior experiments are expensive and complex, tracking errors sometimes make large portions of the experimental data unusable. Here our deep learning method, Deep Imputation for Skeleton data (DISK), uncovers dependencies between keypoints and their dynamics to impute missing tracking data without the help of any manual annotations. We demonstrate the utility and performance of DISK on seven animal skeletons including multi-animal setups. The imputed recordings allow us to detect more episodes of motion, such as steps, and obtain more statistically robust results when comparing these episodes between experimental conditions. In addition, by learning to impute the missing content, DISK learns meaningful representations of the data capturing, for example, underlying actions. This stand-alone imputation package, available athttps://github.com/bozeklab/DISK.git/, is applicable to outputs of tracking methods (marker-based or markerless) and allows for varied types of downstream analysis. Computational neuroscience Machine learning biology
N Nature Methods · Nov 27, 2025 A comprehensive foundation model for cryo-EM image processing Cryogenic electron microscopy (cryo-EM) has become a premier technique for determining high-resolution structures of biological macromolecules. However, its broad application is constrained by the demand for specialized expertise. Here, to address this limitation, we introduce the Cryo-EM Image Evaluation Foundation (Cryo-IEF) model, a versatile tool pre-trained on ~65 million cryo-EM particle images through unsupervised learning. Cryo-IEF performs diverse cryo-EM processing tasks, including particle classification by structure, pose-based clustering and image quality assessment. Building on this foundation, we developed CryoWizard, a fully automated single-particle cryo-EM processing pipeline enabled by fine-tuned Cryo-IEF for efficient particle quality ranking. CryoWizard resolves high-resolution structures across samples of varied properties and effectively mitigates the prevalent challenge of preferred orientation in cryo-EM. Cryoelectron microscopy Machine learning Proteins biology
N Nature Methods · Nov 24, 2025 Helixer: ab initio prediction of primary eukaryotic gene models combining deep learning and a hidden Markov model The accurate identification of genes is vital for understanding biological function, yet this remains challenging across many newly sequenced or less-studied species. Here we present Helixer, an artificial intelligence-based tool for ab initio gene prediction that delivers highly accurate gene models across fungal, plant, vertebrate and invertebrate genomes. Unlike traditional methods, Helixer operates without requiring additional experimental data such as RNA sequencing, making it broadly applicable to diverse species. We show that Helixer’s pretrained models achieve accuracy on par with or exceeding current tools, producing gene annotations that closely match expert-curated references across multiple evaluation metrics. Its design enables immediate use on genomes without retraining, providing an efficient, accessible solution for genome annotation in both research and applied settings. The tool is available as an open-source software for local installation via GitHub. An online web interface is also available as well as through the Galaxy ToolShed. Computational biology and bioinformatics Genome informatics Machine learning biology
N Nature Methods · Nov 18, 2025 ImmunoMatch learns and predicts cognate pairing of heavy and light immunoglobulin chains The development of stable antibodies formed by compatible heavy (H) and light (L) chain pairs is crucial in both in vivo maturation of antibody-producing cells and ex vivo designs of therapeutic antibodies. We present ImmunoMatch, a machine-learning framework trained on paired H and L sequences from human B cells to identify molecular features underlying chain compatibility. ImmunoMatch distinguishes cognate from random H–L pairs and captures differences associated withκandλlight chains, reflecting B cell selection mechanisms in the bone marrow. We apply ImmunoMatch to reconstruct paired antibodies from spatial VDJ sequencing data and study the refinement of H–L pairing across B cell maturation stages in health and disease. We find further that ImmunoMatch is sensitive to sequence differences at the H–L interface. These insights provide a computational lens into the broader biological principles governing antibody assembly and stability. Adaptive immunity Lymphocytes Machine learning Software biology
N Nature Methods · Nov 11, 2025 Universal consensus 3D segmentation of cells from 2D segmented stacks Cell segmentation is the foundation of a wide range of microscopy-based biological studies. Deep learning has revolutionized two-dimensional (2D) cell segmentation, enabling generalized solutions across cell types and imaging modalities. This has been driven by the ease of scaling up image acquisition, annotation and computation. However, three-dimensional (3D) cell segmentation, requiring dense annotation of 2D slices, still poses substantial challenges. Manual labeling of 3D cells to train broadly applicable segmentation models is prohibitive. Even in high-contrast images annotation is ambiguous and time-consuming. Here we develop a theory and toolbox, u-Segment3D, for 2D-to-3D segmentation, compatible with any 2D method generating pixel-based instance cell masks. u-Segment3D translates and enhances 2D instance segmentations to a 3D consensus instance segmentation without training data, as demonstrated on 11 real-life datasets, comprising >70,000 cells, spanning single cells, cell aggregates and tissue. Moreover, u-Segment3D is competitive with native 3D segmentation, even exceeding when cells are crowded and have complex morphologies. Cellular imaging Image processing Machine learning Software biology
N Nature Methods · Nov 03, 2025 Squidiff: predicting cellular development and responses to perturbations using a diffusion model Single-cell sequencing has revolutionized our understanding of cellular heterogeneity and responses to environmental stimuli. However, mapping transcriptomic changes across diverse cell types in response to various stimuli and elucidating underlying disease mechanisms remains challenging. Here we present Squidiff, a diffusion model-based generative framework that predicts transcriptomic changes across diverse cell types in response to environmental changes. We demonstrate the robustness of Squidiff across cell differentiation, gene perturbation and drug response prediction. Through continuous denoising and semantic feature integration, Squidiff learns transient cell states and predicts high-resolution transcriptomic landscapes over time and conditions. Furthermore, we applied Squidiff to model blood vessel organoid development and cellular responses to neutron irradiation and growth factors. Our results demonstrate that Squidiff enables in silico screening of molecular landscapes and cellular state transitions, facilitating rapid hypothesis generation and providing valuable insights into the regulatory principles of cell fate decisions. Biotechnology Computational models Machine learning Stem-cell differentiation biology
N Nature Methods · Oct 30, 2025 Nicheformer: a foundation model for single-cell and spatial omics Tissue makeup depends on the local cellular microenvironment. Spatial single-cell genomics enables scalable and unbiased interrogation of these interactions. Here we introduce Nicheformer, a transformer-based foundation model trained on both human and mouse dissociated single-cell and targeted spatial transcriptomics data. Pretrained on SpatialCorpus-110M, a curated collection of over 57 million dissociated and 53 million spatially resolved cells across 73 tissues on cellular reconstruction, Nicheformer learns cell representations that capture spatial context. It excels in linear-probing and fine-tuning scenarios for a newly designed set of downstream tasks, in particular spatial composition prediction and spatial label prediction. Critically, we show that models trained only on dissociated data fail to recover the complexity of spatial microenvironments, underscoring the need for multiscale integration. Nicheformer enables the prediction of the spatial context of dissociated cells, allowing the transfer of rich spatial information to scRNA-seq datasets. Overall, Nicheformer sets the stage for the next generation of machine-learning models in spatial single-cell analysis. Computational models Machine learning Software Transcriptomics biology mouse experiments
N Nature Methods · Oct 29, 2025 Annotating the genome at single-nucleotide resolution with DNA foundation models Genome annotation models that directly analyze DNA sequences are indispensable for modern biological research, enabling rapid and accurate identification of genes and other functional elements. Current annotation tools are typically developed for specific element classes and trained from scratch using supervised learning on datasets that are often limited in size. Here we frame the genome annotation problem as multilabel semantic segmentation and introduce a methodology for fine-tuning pretrained DNA foundation models to segment 14 different genic and regulatory elements at single-nucleotide resolution. We leverage the self-supervised pretrained model Nucleotide Transformer to develop a general segmentation model, SegmentNT, capable of processing DNA sequences up to 50-kb long and that achieves state-of-the-art performance on gene annotation, splice site and regulatory elements detection. We also integrated in our framework the foundation models Enformer and Borzoi, extending the sequence context up to 500 kb and enhancing performance on regulatory elements. Finally, we show that a SegmentNT model trained on human genomic elements generalizes to different species, and a multispecies SegmentNT model achieves strong generalization across unseen species. Our approach is readily extensible to additional models, genomic elements and species. Genomics Machine learning Software biology
N Nature Methods · Oct 27, 2025 Improved reconstruction of single-cell developmental potential with CytoTRACE 2 While single-cell RNA sequencing has advanced our understanding of cell fate, identifying molecular hallmarks of potency—a cell’s ability to differentiate into other cell types—remains a challenge. Here we introduce CytoTRACE 2, an interpretable deep learning framework for predicting absolute developmental potential from single-cell RNA sequencing data. Across diverse platforms and tissues, CytoTRACE 2 outperformed previous methods in predicting developmental hierarchies, enabling detailed mapping of single-cell differentiation landscapes and expanding insights into cell potency. Cancer genomics Machine learning Software Stem cells Transcriptomics biology
N Nature Methods · Oct 22, 2025 scooby: modeling multimodal genomic profiles from DNA sequence at single-cell resolution Understanding how regulatory sequences shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA sequencing and epigenomic profiling provides opportunities to build models capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multimodal technologies. Here, we introduce scooby, a framework to model genomic profiles of single-cell RNA-sequencing coverage and single-cell assay for transposase-accessible chromatin using sequencing insertions from sequence at single-cell resolution. For this, we leverage the pretrained multiomics profile predictor Borzoi and equip it with a cell-specific decoder. Scooby recapitulates cell-specific expression levels of held-out genes and identifies regulators and their putative target genes. Moreover, scooby allows resolving single-cell effects of bulk expression quantitative trait loci and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells. Computational models Machine learning Software Transcriptomics biology
N Nature Methods · Oct 15, 2025 gReLU: a comprehensive framework for DNA sequence modeling and design Deep learning models trained on DNA sequences can predict cell-type-specific regulatory activity, reveal cis-regulatory grammar, prioritize genetic variants and design synthetic DNA. However, building and interpreting these models correctly remains difficult, and models and software built by different groups are often not interoperable. Here we present gReLU, a comprehensive software framework that enables advanced sequence modeling pipelines, including data preprocessing, modeling, evaluation, interpretation, variant effect prediction and regulatory element design. gReLU advances deep-learning-based modeling and analysis of DNA sequences with comprehensive toolsets and versatile applications. Genomics Machine learning Software Genetics Machine Learning Genomics Human
N Nature Methods · Oct 13, 2025 Deep generative modeling of sample-level heterogeneity in single-cell genomics Single-cell genomic studies were recently conducted on hundred of samples exhibiting complex designs. These data have tremendous potential for discovering how sample- or tissue-level phenotypes relate to cellular and molecular composition. However, current analyses are often based on simplified representations of these data by averaging information across cells. We present multi-resolution variational inference (MrVI), a deep generative model designed to realize the potential of cohort studies at the single-cell level. MrVI tackles two fundamental, intertwined problems: stratifying samples into groups and evaluating the cellular and molecular differences between groups, without requiring predefined cell states. Leveraging its single-cell perspective, MrVI detects clinically relevant stratifications of cohorts of people with COVID-19 or inflammatory bowel disease that are manifested in only certain cellular subsets, enabling new discoveries that would otherwise be overlooked. MrVI can de novo identify groups of small molecules with similar biochemical properties and evaluate their effects on cellular composition and gene expression in large-scale perturbation studies. MrVI is an open-source tool atscvi-tools.org. Machine learning Software Statistical methods Transcriptomics biology
N Nature Methods · Oct 08, 2025 Automated classification of cellular expression in multiplexed imaging data with Nimbus Multiplexed imaging offers a powerful approach to characterize the spatial topography of tissues in both health and disease. To analyze such data, the specific combination of markers that are present in each cell must be enumerated to enable accurate phenotyping, a process that often relies on unsupervised clustering. We constructed the Pan-Multiplex (Pan-M) dataset containing 197 million distinct annotations of marker expression across 15 different cell types. We used Pan-M to create Nimbus, a deep learning model to predict marker positivity from multiplexed image data. Nimbus is a pretrained model that uses the underlying images to classify marker expression of individual cells as positive or negative across distinct cell types, from different tissues, acquired using different microscope platforms, without requiring any retraining. We demonstrate that Nimbus predictions capture the underlying staining patterns of the full diversity of markers present in Pan-M, and that Nimbus matches or exceeds the accuracy of previous approaches that must be retrained on each dataset. We then show how Nimbus predictions can be integrated with downstream clustering algorithms to robustly identify cell subtypes in image data. We have open-sourced Nimbus and Pan-M to enable community use athttps://github.com/angelolab/Nimbus-Inference. Image processing Machine learning Software biology
N Nature Methods · Oct 03, 2025 All-at-once RNA folding with 3D motif prediction framed by evolutionary information Structural RNAs exhibit a vast array of recurrent short three-dimensional (3D) elements found in loop regions involving non-Watson–Crick interactions that help arrange canonical double helices into tertiary structures. Here we present CaCoFold-R3D, a probabilistic grammar that predicts these RNA 3D motifs (also termed modules) jointly with RNA secondary structure over a sequence or alignment. CaCoFold-R3D uses evolutionary information present in an RNA alignment to reliably identify canonical helices (including pseudoknots) by covariation. Here we further introduce the R3D grammars, which also exploit helix covariation that constrains the positioning of the mostly noncovarying RNA 3D motifs. Our method runs predictions over an almost-exhaustive list of over 50 known RNA motifs (‘everything’). Motifs can appear in any nonhelical loop region (including three-way, four-way and higher junctions) (‘everywhere’). All structural motifs as well as the canonical helices are arranged into one single structure predicted by one single joint probabilistic grammar (‘all-at-once’). Our results demonstrate that CaCoFold-R3D is a valid alternative for predicting the all-residue interactions present in a RNA 3D structure. CaCoFold-R3D is fast and easily customizable for novel motif discovery and shows promising value both as a strong input for deep learning approaches to all-atom structure prediction as well as toward guiding RNA design as drug targets for therapeutic small molecules. Computational models Machine learning Non-coding RNAs Riboswitches biology
N Nature Methods · Oct 02, 2025 Foundation model for efficient biological discovery in single-molecule time traces Single-molecule fluorescence microscopy (SMFM) can reveal important biological insights. However, uncovering rare but critical intermediates often demands manual inspection of time traces and iterative ad hoc approaches. To facilitate systematic and efficient discovery from SMFM time traces, we introduce META-SiM, a transformer-based foundation model pretrained on diverse SMFM analysis tasks. META-SiM rivals best-in-class algorithms on a broad range of tasks including trace classification, segmentation, idealization and stepwise photobleaching analysis. Additionally, the model produces embeddings that encapsulate detailed information about each trace, which the web-based META-SiM Projector (https://www.simol-projector.org) casts into lower-dimensional space for efficient whole-dataset visualization, labeling, comparison and sharing. Combining this Projector with the objective metric of local Shannon entropy enables rapid identification of condition-specific behaviors, even if rare or subtle. Applying META-SiM to an existing single-molecule Förster resonance energy transfer dataset, we discover a previously undetected intermediate state in pre-mRNA splicing. META-SiM removes bottlenecks, improves objectivity and both systematizes and accelerates biological discovery in single-molecule data. Machine learning Single-molecule biophysics Single-cell Machine Learning Structural Biology
N Nature Methods · Oct 01, 2025 Fourier-based three-dimensional multistage transformer for aberration correction in multicellular specimens High-resolution tissue imaging is often compromised by sample-induced optical aberrations that degrade resolution and contrast. Although wavefront sensor-based adaptive optics (AO) can measure these aberrations, such hardware solutions are typically complex, expensive to implement and slow when serially mapping spatially varying aberrations across large fields of view. Here we introduceAOViFT(adaptive optical vision Fourier transformer)—a machine learning-based aberration sensing framework built around a three-dimensional multistage vision transformer that operates on Fourier domain embeddings.AOViFTinfers aberrations and restores diffraction-limited performance in puncta-labeled specimens with substantially reduced computational cost, training time and memory footprint compared to conventional architectures or real-space networks. We validatedAOViFTon live gene-edited zebrafish embryos, demonstrating its ability to correct spatially varying aberrations using either a deformable mirror or postacquisition deconvolution. By eliminating the need for the guide star and wavefront sensing hardware and simplifying the experimental workflow,AOViFTlowers technical barriers for high-resolution volumetric microscopy across diverse biological samples. Computational models Machine learning Machine Learning Zebrafish Cell Biology Microscopy
N Nature Methods · Sep 25, 2025 Merging conformational landscapes in a single consensus space with FlexConsensus algorithm Structural heterogeneity analysis in cryogenic electron microscopy is experiencing a breakthrough in estimating more accurate, richer and interpretable conformational landscapes derived from experimental data. The emergence of new methods designed to tackle the heterogeneity challenge reflects this new paradigm, enabling users to gain a better understanding of protein dynamics. However, the question of how intrinsically different heterogeneity algorithms compare remains unsolved, which is crucial for determining the reliability, stability and correctness of the estimated conformational landscapes. Here, to overcome the previous challenge, we introduce FlexConsenus: a multi-autoencoder neural network able to learn the commonalities and differences among several conformational landscapes, enabling them to be placed in a shared consensus space with enhanced reliability. The consensus space enables the measurement of reproducibility in heterogeneity estimations, allowing users to either focus their analysis on particles with a stable estimation of their structural variability or concentrate on specific particle subsets detected by only certain methods. Image processing Machine learning Software Structural Biology Cryo-EM Machine Learning
N Nature Methods · Sep 25, 2025 EpiAgent: foundation model for single-cell epigenomics Although single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) enables the exploration of the epigenomic landscape that governs transcription at the cellular level, the complicated characteristics of the sequencing data and the broad scope of downstream tasks mean that a sophisticated and versatile computational method is urgently needed. Here we introduce EpiAgent, a foundation model pretrained on our manually curated large-scale Human-scATAC-Corpus. EpiAgent encodes chromatin accessibility patterns of cells as concise ‘cell sentences’ and captures cellular heterogeneity behind regulatory networks via bidirectional attention. Comprehensive benchmarks show that EpiAgent excels in typical downstream tasks, including unsupervised feature extraction, supervised cell type annotation and data imputation. By incorporating external embeddings, EpiAgent enables effective cellular response prediction for both out-of-sample stimulated and unseen genetic perturbations, reference data integration and query data mapping. Through in silico knockout ofcis-regulatory elements, EpiAgent demonstrates the potential to model cell state changes. EpiAgent is further extended to directly annotate cell types in a zero-shot manner. Computational models Data integration Machine learning Software Single-cell Machine Learning Genomics Human
N Nature Methods · Sep 15, 2025 Spatial gene expression at single-cell resolution from histology using deep learning with GHIST The increased use of spatially resolved transcriptomics provides new biological insights into disease mechanisms. However, the high cost and complexity of these methods are barriers to broader application. Consequently, methods have been created to predict spot-based gene expression from routinely collected histology images. Recent benchmarking showed that current methodologies have limited accuracy and spatial resolution, constraining translational capacity. Here, we introduce GHIST, a deep learning-based framework that predicts spatial gene expression at single-cell resolution by leveraging subcellular spatial transcriptomics and synergistic relationships between multiple layers of biological information. We validated GHIST using public datasets and The Cancer Genome Atlas data, demonstrating its flexibility across different spatial resolutions and superior performance. Our results underscore the utility of in silico generation of single-cell spatial gene expression measurements and the capacity to enrich existing datasets with a spatially resolved omics modality, paving the way for scalable multi-omics analysis and biomarker identification. Computational models Functional genomics Gene expression Image processing Machine learning Machine Learning Cancer Single-cell Genomics Human
N Nature Methods · Sep 15, 2025 Integrating diverse experimental information to assist protein complex structure prediction by GRASP Protein complex structure prediction is crucial for understanding of biological activities and advancing drug development. While various experimental methods can provide structural insights into protein complexes, the knowledge obtained is often sparse or approximate. A general tool is needed to integrate limited experimental information for high-throughput and accurate prediction. Here we introduce GRASP to efficiently and flexibly incorporate diverse forms of experimental information. GRASP outperforms existing tools in handling both simulated and real-world experimental restraints including those from crosslinking, covalent labeling, chemical shift perturbation and deep mutational scanning. For example, GRASP excels at predicting antigen–antibody complex structures, even surpassing AlphaFold3 when using experimental deep mutational scanning or covalent-labeling restraints. Beyond its accuracy and flexibility in restrained structure prediction, GRASP’s ability to integrate multiple forms of restraints enables integrative modeling. We also showcase its potential in modeling protein structural interactome under near-cellular conditions using previously reported large-scale in situ crosslinking data for mitochondria. Cryoelectron microscopy Machine learning Protein structure predictions Solution-state NMR Structural Biology Proteomics Machine Learning Drug Development
N Nature Methods · Sep 15, 2025 Scaling up spatial transcriptomics for large-sized tissues: uncovering cellular-level tissue architecture beyond conventional platforms with iSCALE Recent advances in spatial transcriptomics (ST) technologies have transformed our ability to profile gene expression while preserving crucial spatial context within tissues. However, existing ST platforms are constrained by high costs, long turnaround times, low resolution, limited gene coverage and inherently small tissue capture areas, which hinder their broad applications. Here we present iSCALE, a method that reconstructs large-scale, super-resolution gene expression landscapes and automatically annotates cellular-level tissue architecture in samples exceeding capture areas of current ST platforms. The performance of iSCALE was assessed by comprehensive evaluations involving benchmarking experiments, immunohistochemistry staining and manual annotations by pathologists. When applied to multiple sclerosis human brain samples, iSCALE uncovered lesion-associated cellular characteristics undetectable by conventional ST experiments. Our results demonstrate the utility of iSCALE in analyzing large tissues by enabling unbiased annotation, resolving cell type composition, mapping cellular microenvironments and revealing spatial features beyond the reach of standard ST analysis or routine histopathological assessment. Gene expression analysis Machine learning RNA sequencing Transcriptomics Neuroscience Single-cell Genomics Human Machine Learning
N Nature Methods · Sep 11, 2025 Biophysics-based protein language models for protein engineering Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose mutational effect transfer learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics. We fine-tune METL on experimental sequence–function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering. Machine learning Protein design Machine Learning Structural Biology Proteomics
N Nature Methods · Sep 08, 2025 Scvi-hub: an actionable repository for model-driven single-cell analysis The growing availability of single-cell omics datasets presents new opportunities for reuse, while challenges in data transfer, normalization and integration remain a barrier. Here we present scvi-hub: a platform for efficiently sharing and accessing single-cell omics datasets using pretrained probabilistic models. It enables immediate execution of fundamental tasks like visualization, imputation, annotation and deconvolution on new query datasets using state-of-the-art methods, with massively reduced storage and compute requirements. We show that pretrained models support efficient analysis of large references, including the CZI CELLxGENE Discover Census. Scvi-hub is built within the scvi-tools open-source environment and integrated into scverse. Scvi-hub offers a scalable and user-friendly framework for accessing and contributing to a growing ecosystem of ready-to-use models and datasets, thus putting the power of atlas-level analysis at the fingertips of a broad community of users. Machine learning Software Statistical methods Transcriptomics Single-cell Machine Learning Genomics Human
N Nature Methods · Aug 26, 2025 A realistic phantom dataset for benchmarking cryo-ET data annotation Cryo-electron tomography (cryo-ET) is a powerful technique for imaging molecular complexes in their native cellular environments. However, identifying the vast majority of molecular species in cellular tomograms remains prohibitively difficult. Machine learning (ML) methods provide an opportunity to automate the annotation process, but algorithm development has been hindered by the lack of large, standardized datasets. Here we present an experimental phantom dataset with comprehensive ground-truth annotations for six molecular species to spur new algorithm development and benchmark existing tools. This annotated dataset is available on the CryoET Data Portal with infrastructure to streamline access for methods developers across fields. Cryoelectron tomography Data acquisition Machine learning Protein databases Proteins Cryo-EM Machine Learning Structural Biology
N Nature Methods · Aug 26, 2025 DeepMVP: deep learning models trained on high-quality data accurately predict PTM sites and variant-induced alterations Post-translational modifications (PTMs) are critical regulators of protein function, and their disruption is a key mechanism by which missense variants contribute to disease. Accurate PTM site prediction using deep learning can help identify PTM-altering variants, but progress has been limited by the lack of large, high-quality training datasets. Here, we introduce PTMAtlas, a curated compendium of 397,524 PTM sites generated through systematic reprocessing of 241 public mass-spectrometry datasets, and DeepMVP, a deep learning framework trained on PTMAtlas to predict PTM sites for phosphorylation, acetylation, methylation, sumoylation, ubiquitination and N-glycosylation. DeepMVP substantially outperforms existing tools across all six PTM types. Its application to predicting PTM-altering missense variants shows strong concordance with experimental results, validated using literature-curated variants and cancer proteogenomic datasets. Together, PTMAtlas and DeepMVP provide a robust platform for PTM research and a scalable framework for assessing the functional consequences of coding variants through the lens of PTMs. Genomics Machine learning Post-translational modifications Proteome informatics Proteomics Machine Learning Proteomics Genomics Human Cancer