Cell2fate infers RNA velocity modules to improve cell fate prediction

Cell2fate infers RNA velocity modules to improve cell fate prediction Cell2fate infers RNA velocity modules to improve cell fate prediction


The cell2fate model

Cell2fate builds on established concepts for RNA velocity1,2, employing a dynamical model to explain variation in spliced (s) and unspliced (u) read counts for individual genes and cells (Fig. 1a), which can be defined in two coupled ODEs:

$$\frac{d{u}_{g}}{{dt}}={\alpha }_{g}(t)-{\beta }_{g}{u}_{g}$$

(1)

$$\frac{d{s}_{g}}{{dt}}={\beta }_{g}{u}_{g}-{\gamma }_{g}{s}_{g}.$$

(2)

Fig. 1: Cell2fate model overview.
figure 1

ad, Cell2fate allows inferring complex and subtle transcriptional dynamics (a) by modeling gene-specific transcription rates (b) using a smaller number of independent modules with simple dynamics that also give rise to a modular structure in RNA velocity (c) and counts (d). λ denotes the rate of module activation (ON) or deactivation (OFF). The superscript ‘biol’ denotes that counts (u and s) do not include technical factors of variation, such as sequencing depth.

Here, α, β, γ denote the transcription, splicing and degradation rates for different genes g. Solving the ODEs for u and s and fitting the equations to observed counts allows estimation of the unknown rate parameters, which can then be substituted into equation (2) to obtain the rate of change in spliced counts in each cell, \(\frac{{ds}}{{dt}}\), which in turn gives rise to what is commonly referred to as ‘RNA velocity’ (ref. 1). An important challenge for RNA velocity models is that transcription rates as a function of differentiation time (\({\alpha }_{g}(t)\)) can be complex and nonlinear, reflecting the implicit dependency on active transcription factor (TF) abundance in the nucleus13, yet the integral of the transcription rate function \({\alpha }_{g}(t)\) needs to remain tractable to allow the ODEs to be solved efficiently. As a consequence, existing methods either assume simplified stepwise functions for αg(t) (refs. 1,2,5,7) or they resort to numerical approximations to solve the ODE6,8,9,10 (Supplementary Notes and Supplementary Fig. 3).

In cell2fate, we use an expansion of the derivative of the transcription rate in terms of individually integrable basis functions, which we refer to as linearization in the following:

$$\frac{d{\alpha }_{g}}{{dt}}=\mathop{\sum }\limits_{m=1}^{M}\frac{d{\alpha }_{{mg}}}{{dt}}=\mathop{\sum }\limits_{m=1}^{M}{\lambda }_{{mi}}({\hat{\alpha }}_{{mgi}}-{\alpha }_{{mg}}).$$

(3)

We term each basis function a module, denoted by a subscript m. Subscript i denotes the state of a module (ON or OFF). \({\hat{\alpha }}_{{mgi}}\) is the target transcription rate of a module, which takes on nonzero values for all genes, when the module is in the ON state (Fig. 1b, bottom right). \({\lambda }_{{mi}}\) is the rate at which the target transcription rate is reached, and the state i depends on switch times Tm,ON/Tm,OFF on a cell-specific timescale Tc (Fig. 1b, top). This choice of basis functions allows for each individual ODE as well as their total sum to be solved analytically (Methods and Supplementary Notes). The parameters Tm,ON/Tm,OFF, \({\lambda }_{{mi}}\) and Tc are shared across all genes, which vastly reduces the number of parameters that need to be estimated compared to existing models yet still provides gene-specific transcription rates \({\alpha }_{{mg}}\).

In addition to being appealing for computational reasons, the linearization also provides a biophysical connection between RNA velocity and statistical dimensionality reduction. This link becomes apparent when casting the linearization as a mixed membership model, in which transcription rates, RNA velocity and spliced and unspliced counts of each gene are governed by a linear combination of M modules (Methods and Fig. 1b). The mixing coefficients can then be interpreted analogously to gene loadings of factor analysis or principal-component analysis. Mechanistically, modules can be interpreted as approximating the transcription rate changes induced by all active regulatory proteins as a small set of independent effects that are each valid within time windows defined by Tm,ON/Tm,OFF.

Cell2fate is a fully Bayesian model that is fit to raw cell-level counts as input, and it includes a series of refinements to account for technical sources of variation, including overdispersion, variation in detection sensitivity of spliced and unspliced RNA molecules, ambient RNA and known batches (Methods). A hierarchical prior structure achieves efficient regularization of model parameters while sharing evidence strength across genes (Methods). The model is implemented in the probabilistic programming language Pyro14 and fit using Pyro’s stochastic variational inference framework using a customized variational distribution structure to account for dependencies between model parameters (Methods). The software implementation comes with guidelines and heuristics to determine hyperparameters such as the number of modules (Methods), and it builds on scvi-tools15 to facilitate its integration in existing workflows.

Improved predictions of cell fates and transcription rates

To benchmark the performance of cell2fate, we compared the model to existing RNA velocity methods by assessing the consistency of estimated cell fate trajectories with prior knowledge. Briefly, we considered the cross-boundary directional correctness (CBDir) metric to benchmark alternative models, thereby scoring the consistency of transition probabilities at the boundary between cell clusters with known transitions3 (Methods).

We considered ten RNA velocity methods spanning different model classes and approaches for parameter inference (Table 1 and Methods). All methods were applied to five scRNA-seq datasets, including widely used benchmark datasets such as the developing mouse dentate gyrus16 and pancreas17 (Supplementary Fig. 1). To test the ability of different methods to resolve complex transcriptional dynamics, we also examined mouse erythroid maturation18 and human bone marrow19: two datasets that feature multiple transcriptional boosts across cellular trajectories12. Finally, we considered a mouse bone marrow dataset with markedly low unique molecular identifier (UMI) counts1, thereby assessing the extent to which these models cope with low-coverage data.

Table 1 Summary of RNA velocity model versions and parameter settings used in benchmark

On average, across all five datasets, cell2fate achieved the best CBDir scores. More importantly, cell2fate inferred the correct directionality of cell fate transitions in all datasets, whereas all other methods, with the notable exception of pyroVelocity_model2, inferred reverse-order dynamics in at least one benchmark setting (corresponding to negative CBDir values). Inspecting the benchmarking results, we could attribute the performance of cell2fate to overcoming two major challenges as elaborated below.

First, cell2fate provided sufficient statistical power to identify correct velocity flows from subtle transcriptional dynamics. For example, in the mouse dentate gyrus dataset, other methods failed to resolve the late maturation trajectory of granule neurons, thus incorrectly inferring that mature cells transitioned into their immature counterparts (Fig. 2b, blue inset boxes and Supplementary Fig. 2). The same result was observed when applying CellRank20 to the velocity estimates instead of uniform manifold approximation and projection (UMAP), indicating that only cell2fate could resolve the trajectory toward mature granule neurons (Supplementary Figs. 3 and 4).

Fig. 2: Enhanced performance of cell2fate in benchmark of ten methods across five datasets.
figure 2

a, Performance of ten methods to reconstruct known trajectories on five datasets. Shown is the CBDir3, with positive values corresponding to correct lineage reconstructions. bd, Examples for datasets that harbor specific challenges, including weak signals in mature cell types (OPC, oligodendrocyte progenitor cell; OL, oligodendrocyte) (b) and complex transcription rate dynamics (c,d). d, Transcription rate as inferred for two selected genes that have been postulated to have stepwise changes in transcription rates18. e, Cell-specific time estimates from cell2fate. Left, astrocytes have much higher time than the neuron lineage. Right, one connected time range indicates a single lineage. f, CV of the posterior time can be used as an uncertainty measure to assess the suitability of a dataset for cell2fate analysis. In a steady-state dataset of peripheral blood mononuclear cells, this CV is close to 1 throughout. NK, natural killer; Treg, regulatory T cell.

Second, cell2fate correctly reconstructs complex transcriptional dynamics. In the mouse erythroid maturation datasets, the model resolved the correct cell trajectories, whereas other models tended to perform poorly (Fig. 2c and Supplementary Fig. 5). Previous analysis18, based on the visual inspection of spliced and unspliced counts across manually annotated cell clusters, has provided evidence that mouse erythroid lineage formation features many ‘multi-rate kinetic’ genes such as Hba-x and Nudt4 that display coordinated changes in transcription rates across the cell maturation trajectory18. Consistently, cell2fate recapitulated the stepwise transcriptional rate boosts in these multi-rate kinetic genes18 (Fig. 2d, turquoise line). By contrast, other methods, such as pyroVelocity_model2, can only predict a single nonzero transcription rate, due to their simpler underlying dynamical model (Fig. 2d, green line), results that we again confirmed using CellRank (Supplementary Figs. 6 and 7).

As an alternative metric to CBDir, we also assessed concordance of the estimated time differences between clusters with prior knowledge (Supplementary Table 8), and we compared time outputs to known developmental ages of different mouse samples (Supplementary Figs. 8 and 9); these alternative metrics and benchmarks confirmed the performance benefits of cell2fate. Notably, granule neuron subclusters with higher cell2fate estimated time contained more cells from the later developmental ages (Supplementary Fig. 8b), validating cell2fate estimates. Alternative methods as well as standard diffusion pseudotime analysis did not resolve this granule neuron maturation trajectory (Supplementary Figs. 8c and 10). Furthermore, cell2fate was robust to changes in the assumptions on prior distributions (Supplementary Fig. 11), and we confirmed the ability of the model to estimate ground truth dynamical rates when applying the model to semisynthetic data (Supplementary Fig. 12). Finally, we note that, while cell2fate has higher computational requirements than some existing methods, the computational requirements of the model are aligned with alternatives so that cell2fate can be readily applied to larger datasets (Supplementary Tables 6 and 7 and Supplementary Fig. 13).

We note that cell2fate’s cell-specific timescale3,6,7 (Fig. 1a,b) provides two additional use cases that can help to gain deeper insights. First, it aids the identification of cell lineage progression and distinct cell lineages. For example, in the mouse dentate gyrus dataset, granule neurons and astrocytes were assigned markedly disconnected time points, with oligodendrocytes occupying a mid-time point range (Fig. 2e, left), consistent with the distinct lineage origins of these three cell types16. By contrast, in the mouse erythroid maturation dataset, a single lineage with a single connected time range is identified (Fig. 2e, right). Second, inspecting the Bayesian posterior5,6,7 of the cell-specific time provides a principled measure of confidence in the RNA velocity values across and within datasets. In both datasets mentioned above, the coefficient of variation (CV) of the posterior distribution of individual cell times was consistently estimated to be close to zero, indicating low uncertainty (Fig. 2e, bottom). By contrast, cell2fate applied to a steady-state dataset of peripheral blood mononuclear cells21, in which no consistent transcriptional dynamics are expected12, results in confidence estimates with a CV close to 1, indicating high uncertainty (Fig. 2f). The posterior uncertainties can also be used to estimate confidence levels in individual transitions. To do so, cell2fate implements a heuristic confidence score based on the fraction of posterior samples from the cell-specific time in one cluster that are higher than the 90th percentile of samples from another cluster. This heuristic correlates well with the CBDir score of the corresponding transitions (ρ = 0.56), indicating that it is well calibrated (Supplementary Fig. 14). Hence, the posterior of cell-specific time can serve as quality control to assess whether cell2fate identifies meaningful dynamics in a given dataset.

In sum, our benchmark demonstrates cell2fate’s enhanced statistical power to estimate cell trajectories and resolve complex transcriptional dynamics and the ability to quantify uncertainty in velocity estimates.

RNA velocity modules map fine stages of late cell maturation

Cell2fate modules are sequentially activated gene expression programs over time. Given their biophysical foundations in transcriptional kinetics, we expect that RNA velocity modules can provide a more granular characterization of dynamic processes during cellular differentiation than conventional dimensionality reduction techniques that lack a mechanistic basis, such as matrix factorization or clustering. In addition, cell2fate comes with a suite of downstream analysis and visualization tools, enabling users to explore dynamic processes and derive biological insights.

To demonstrate the cell2fate toolkit, we considered the mouse brain single-cell dataset included as part of the benchmarking study (Fig. 2b), profiling the dentate gyrus region in the hippocampus across two developmental stages16. In addition to early differentiation of neurons and astrocytes from neural progenitors, this dataset covers the late maturation trajectory of granule neurons (that is, the late differentiation after the immature neuron stage), a critical process that is however not well understood, and, more generally, it is unknown whether this late maturation process unfolds across successive transcriptional stages. Previous RNA velocity methods applied to this dataset2,3 were able to distinguish neuronal versus astrocyte lineage trajectories; however, the correct trajectory for the most mature granule neurons could not be resolved (Fig. 2b).

Cell2fate applied to this dataset revealed 16 distinct RNA velocity modules (Supplementary Fig. 15), capturing all the expected cell trajectories, with the dominant lineage corresponding to granule neuron differentiation and maturation stemming from neural intermediate progenitor cells (nIPCs), neuroblasts and immature neurons. Radial glial-like progenitor cells are largely committed to astrocytes (Fig. 3a), as evident both from cell2fate’s time estimates and CellRank fate probabilities (Supplementary Fig. 16). We also observed that mossy cells, another neuronal population in the dentate gyrus, were assigned to the middle stages of the granule neuron trajectory. While mossy cells are thought to have different lineage origins, their transcriptional development is highly similar to that of granule neurons22.

Fig. 3: Module decomposition of cell2fate resolves final stages of granule neuron maturation.
figure 3

a, Cell2fate velocity graph embedding for dentate gyrus data. b, Spliced count abundance caused by selected modules over time in the granule neuron lineage. c, Activation of the selected example cell2fate module, weight of the selected MOFA23 factor and Leiden clustering for comparison. d, Modules in each cell can be classified into states for easier interpretation and visualization, based on how much of their maximal steady-state counts they have reached and whether they are increasing or decreasing in expression. e, Activation, defined as the spliced counts produced by a module in each cell. f, State of late granule neuron maturation modules. g, Module marker genes, defined as having a large part of their transcription rate explained by this module. h, Module TF genes, defined as module marker genes that are also known TFs.

To explore the dynamics of neuronal differentiation in greater depth, we used the fitted cell2fate model to estimate the total spliced transcript abundance for each of the nine granule neuron lineage modules in individual cells across the inferred time (Fig. 3d, top). This analysis identified the successive induction of modules across the early differentiation of radial glia into nIPCs, neuroblasts and immature neurons (modules 1–3). Strikingly, cell2fate also recovered dynamics in mature granule neurons, explained by six modules (modules 4–9) that are sequentially activated and temporally overlap across mature granule neurons, thereby finely dissecting the late maturation of these cells into distinct transcriptional windows (Fig. 3d). The model also correctly identified a temporal gap between immature and mature granule neurons (Fig. 3d), which is consistent with prior expectations16. The cell2fate visualization tool complements t-distributed stochastic neighbor embedding or UMAP by providing dynamic insights anchored on estimated differentiation time, and it can also visualize additional metadata such as cell type annotations or developmental age (Fig. 3d, bottom two panels).

The total spliced count estimates can also be used to visualize the dynamics of RNA velocity modules across cells, for example, on a conventional UMAP plot (Fig. 3b). The activation of different modules per cell can be inspected, similar to the factor activity in conventional matrix factorization. We also compared these module activation estimates to conventional factor analysis and clustering methods. Briefly, multi-omics factor analysis (MOFA)23 yielded factors that captured complementary sources of variation, with activity profiles that were temporally more diffuse across the differentiation trajectory. Specifically, these factors did not stratify late granule neuron maturation (Fig. 3b and Supplementary Fig. 17). We also observed overall low correlation between cell2fate module and MOFA factor gene loadings, particularly for the late neuronal maturation modules 4–9 (Supplementary Fig. 18). Similarly, Leiden clustering of the scRNA-seq dataset at different resolutions identified clusters that were not aligned with neuronal maturation (Fig. 3b and Supplementary Fig. 19). Collectively, these observations indicate that cell2fate captures complementary aspects of variation compared to existing decomposition methods and is well suited to conduct granular dissection of granule neuron maturation.

Beyond the activity of modules, the dynamics can be further classified into states within each cell, based on whether they are increasing or decreasing in expression (Fig. 3c and Methods). Both quantities are shown in Fig. 3e,f for granule neuron differentiation. The additional dynamic information in this visualization shows, for example, that module 9 has not reached a steady state, implying that granule neuron maturation continues beyond the time range captured in this dataset.

Finally, we examined to what extent RNA velocity modules can provide deeper insights into the late stages of granule neuron differentiation. We ranked genes by how much of their transcription rate is explained by each module and used top genes as ‘module markers’ (Fig. 3g and Supplementary Fig. 20). We identified Rbm24, Tafa1 (Fam19a1) and Sptb (module 4) and Pakap (Palm2), Pdzd2 and Usp19 (module 5) as markers switched on in immature granule neurons (Fig. 3g and Supplementary Fig. 20). By contrast, Fst, Rapgef5 and Moxd1 (module 6), Prr16, Kdm5d and Rpa3 (module 7) and Rgs2, Nudt13 and 1700048O20Rik (module 8) provide markers of late granule neuron maturation stages (Fig. 3g). Fam19a1 has been reported to suppress neural stem cell maintenance and promote differentiation24,25, consistent with its expression pattern in maturing granule neurons. Rgs2 is dynamically expressed during neuronal activity26 and involved in synaptic plasticity27, consistent with its late induction in module 8. Apart from these two genes, the marker genes reported here have not been functionally investigated in granule neurons or brain development to our knowledge. We further intersected the top 30 markers of each module with a set of 30 Alzheimer’s risk genes28 and 250 autism spectrum disorder risk genes29,30, which highlighted late granule neuron maturation module 8, where the Alzheimer’s associated gene Trappc6a and the autism-associated genes Kdm6a and Nr4a2 were among the markers (Supplementary Table 9). These results highlight that cell2fate can lead to new biological insights, even in a widely characterized cell lineage like granule neurons.

Additionally, we can extract top module marker genes that are TFs as ‘module TFs’ (Fig. 3h). Moving on to such module TFs, Zmat4 (module 6) and Tfam (module 8) are enriched in late granule neuron maturation stages (Fig. 3h). Zmat4 has been reported as upregulated in the auditory cortex of young mice at postnatal day 7 compared to adults31, while Tfam knockout results in immature neuronal phenotypes32. Yet their roles in granule neuron differentiation have not been studied to date. We also find that genes with putative promoter sequences that are most likely to be bound by the top 20 module TFs, as predicted by the ProBound algorithm33, are more frequently among the top 300 module genes than those least likely to be bound by those TFs (Supplementary Fig. 21). These TFs provide putative candidate regulators of late granule neuron differentiation.

In sum, our results demonstrate the great interpretability and statistical power of cell2fate’s module decomposition for scRNA-seq datasets to finely dissect cellular processes and suggest that late granule neuron maturation is composed of distinct stages.

Spatial mapping of RNA velocity modules

Temporal biological processes are often spatially organized in tissues. For example, cell differentiation and migration are often coupled, with cells associating with distinct spatial signaling microenvironments throughout their differentiation trajectories. Here, we sought to link the temporal information captured by cell2fate to spatial tissue organization by mapping RNA velocity modules in a newly generated spatial transcriptomics dataset of human brain development (Fig. 4a).

Fig. 4: Cell2fate interfaces with cell2location to spatially map the cortical neurogenesis process in human brain development.
figure 4

a, Cell2fate steady-state expression of modules is supplied as reference profiles for cell2location to infer abundance of modules across spatial locations. b, Left, illustration of experimental setup. Middle, radial glia cells (green) and intermediate progenitors (violet) first give rise to deep layer neurons (blue) and then switch to produce ULn (yellow). Right, progenitors (green, violet) are located deeper inside the cortex, where they produce newborn neurons (yellow), which differentiate and migrate toward the outside. c, Cell2fate velocity graph UMAP embedding for human brain data. d, Spliced counts produced by each cell2fate module over inferred time. e, A summary spatial plot of three module locations, named by the cell type in which they reach their steady-state expression. f, Module state, markers and locations for individual selected modules. DL, deep layer; UL, upper layer.

We focused on the fetal human cerebral cortex in which excitatory neuron maturation follows a highly stereotyped trajectory through space and time34. Neural progenitors termed radial glia and intermediate progenitors reside in cortical germinal zones, where they sequentially give rise to distinct neuronal subtypes that subsequently migrate out to the deep and upper layers of the cortical plate across their maturation (Fig. 4b). Deep layer-residing neurons (DLn) are born before upper-layer neurons (ULn) in early gestation; hence DLn are relatively more mature than ULn by midgestation (Fig. 4b). Thus, the maturation state and spatial location of cortical excitatory neurons are tightly linked.

To examine cellular differentiation trajectories in the human cortex, we initially performed single-nucleus RNA-sequencing (snRNA-seq) profiling (10x version 3.0) of one donor at midgestation. We then followed standard snRNA-seq processing workflows (Methods) to cluster cells and annotated cell types using markers from the literature35. We annotated distinct neural progenitors (radial glial and intermediate progenitor cells) as well as excitatory neuron populations at different stages of maturation (Fig. 4c). As expected, mature neurons expressed DLn markers, whereas newborn and immature neurons showed enriched expression of ULn markers (Supplementary Fig. 22). We also annotated inhibitory neurons and glial cell types but excluded them from the subsequent excitatory neuron trajectory analysis.

We then applied cell2fate to this human brain snRNA-seq dataset and observed the expected excitatory neuronal differentiation trajectory from neural progenitors to newborn, immature and mature neurons (Fig. 4c). The RNA velocity modules dissected the neuronal trajectory into finer-grained maturation stages, identifying seven sequentially activated and temporally overlapping modules throughout immature and mature neurons (Fig. 4d and Supplementary Fig. 23). While these modules contained some DLn and ULn cell type markers, they also included many genes that are widely expressed across all excitatory neurons in the adult human cortex36, such as PSMC3, KRR1 and BMPER (Supplementary Figs. 24 and 25). This suggests that the modules partially identify a neuronal maturation trajectory common to both DLn and ULn.

In contrast to cell2fate, other RNA velocity methods such as scVelo were not able to accurately identify velocity flow in mature neurons (Supplementary Fig. 26). Additionally, the integrated measurement model of cell2fate allowed us to factor in different detection probabilities for spliced and unspliced counts and correct batch effects in our human brain snRNA-seq dataset (Supplementary Fig. 27), which is crucial for estimating true transcriptional dynamics from observed counts (Methods).

To spatially map our RNA velocity modules in the developing human cortex, we performed Visium spatial RNA-seq profiling (10x CytAssist) of one cortical tissue section from an age-matched donor (Fig. 4b). As the Visium assay offers coarse spatial resolution and profiles multiple cells at each tissue location (that is, Visium spot), we used the cell2location algorithm37 to deconvolve the abundance of RNA velocity modules across spatial data. We used the steady-state expression counts of each module as reference gene expression signatures and then applied the standard cell2location workflow to infer the abundance of each module signature across Visium spots (Fig. 4a and the Methods).

The RNA velocity modules showed expected patterns of spatial mapping across the human cortex (Fig. 4e–f and Supplementary Fig. 28). Progenitor modules spatially mapped to germinal zones (Fig. 4e), while neuronal modules primarily mapped to the cortical plate (Fig. 4e and Supplementary Fig. 28). The fine spatial locations of neuronal modules were consistent with their maturation state (Fig. 4f). The immature ULn module (0) mapped to the upper cortical layers as well as the subplate–intermediate zone that immature neurons pass through during their migration to the cortical plate38. The early mature ULn module (2) was exclusively mapped to the upper cortical layers. By contrast, the late mature DLn module (5) specifically mapped to deep cortical layers.

In sum, our approach provides a workflow to spatially resolve complex cell trajectories through tissues.




Source link

Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use