N Nature Biotechnology · Nov 11, 2025 Multimodal learning enables chat-based exploration of single-cell data Single-cell sequencing characterizes biological samples at unprecedented scale and detail, but data interpretation remains challenging. Here, we present CellWhisperer, an artificial intelligence (AI) model and software tool for chat-based interrogation of gene expression. We establish a multimodal embedding of transcriptomes and their textual annotations, using contrastive learning on 1 million RNA sequencing profiles with AI-curated descriptions. This embedding informs a large language model that answers user-provided questions about cells and genes in natural-language chats. We benchmark CellWhisperer’s performance for zero-shot prediction of cell types and other biological annotations and demonstrate its use for biological discovery in a meta-analysis of human embryonic development. We integrate a CellWhisperer chat box with the CELLxGENE browser, allowing users to interactively explore gene expression through a combined graphical and chat interface. In summary, CellWhisperer leverages large community-scale data repositories to connect transcriptomes and text, thereby enabling interactive exploration of single-cell RNA-sequencing data with natural-language chats. Gene regulation in immune cells Machine learning Preclinical research Software Transcriptomics biology
N Nature Biotechnology · Nov 04, 2025 KATMAP infers splicing factor activity and regulatory targets from knockdown data Typical RNA sequencing (RNA-seq) experiments uncover hundreds of splicing changes, reflecting underlying changes in splicing factor (SF) activity. Understanding how SF activity influences transcriptomic variation requires elucidating how each SF impacts splicing. Here, we present an interpretable regression model, KATMAP, which models splicing changes throughout the transcriptome by analyzing changes in SF binding and the resulting alterations in RNA processing. To learn a regulatory model, KATMAP requires SF perturbation RNA-seq data and the SF’s binding motif as inputs, returning a description of the SF’s position-specific regulatory activity and predicted targets. The KATMAP software includes models pretrained on ENCODE SF knockdown data. Learned KATMAP models can be applied to predict SF regulation andcis-elements at individual exons, which can guide the design of splice-switching antisense oligonucleotides. KATMAP can also interpret RNA-seq data by uncovering the factors responsible for transcriptomic changes, distinguishing direct SF targets from indirect effects and inferring relevant SFs from clinical RNA-seq data. Computational models Software Transcriptomics biology
N Nature Biotechnology · Oct 15, 2025 Predicting functions of uncharacterized gene products from microbial communities The majority of genes in microbial communities remain uncharacterized. Here we develop a method to infer putative function for microbial proteins at scale by assessing community-wide multiomics data. We predict high-confidence functions for >443,000 protein families (~82.3% previously uncharacterized), including >27,000 protein families with weak homology to known proteins and >6,000 protein families without homology. These were drawn from 1,595 gut metagenomes and 800 metatranscriptomes from the Integrative Human Microbiome Project (HMP2/iHMP). Integrating additional information such as sequence similarity, genomic proximity and domain–domain interactions improves performance of the method. Our method’s implementation, FUGAsseM, is generalizable and predicts protein function in both well-studied and undercharacterized communities. FUGAsseM achieves similar levels of accuracy in the context of microbial communities when compared to state-of-the-art approaches designed for application to single organisms while simultaneously providing much greater breadth of coverage. This initial study expands the functional landscape of the human gut microbiome and allows for exploration of microbial proteins in undercharacterized communities. Data integration Gene expression Microbiome Protein function predictions Software Microbiology Genomics Machine Learning Human
N Nature Biotechnology · Sep 10, 2025 Efficient sequence alignment against millions of prokaryotic genomes with LexicMap The size of microbial sequence databases continues to grow beyond the abilities of existing alignment tools. We introduce LexicMap, a nucleotide sequence alignment tool for efficiently querying moderate-length sequences (>250 bp) such as a gene, plasmid or long read against up to millions of prokaryotic genomes. We construct a small set of probek-mers, which are selected to efficiently sample the entire database to be indexed such that every 250-bp window of each database genome contains multiple seedk-mers, each with a shared prefix with one of the probes. Storing these seeds in a hierarchical index enables fast and low-memory alignment. We benchmark both accuracy and potential to scale to databases of millions of bacterial genomes, showing that LexicMap achieves comparable accuracy to state-of-the-art methods but with greater speed and lower memory use. Our method supports querying at scale and within minutes, which will be useful for many biological applications across epidemiology, ecology and evolution. Bacterial genomics Computational models Genetic databases Genome informatics Software Genomics Machine Learning Microbiology