Single-cell nascent RNA sequencing unveils coordinated global transcription

scGRO–seq conceptualization

Capturing nascent RNA with sufficient efficiency from single cells for meaningful analysis was deemed challenging. However, recognizing the potential insights into transcription mechanisms that single-cell nascent RNA sequencing could offer, we set out to develop a single-cell version of the GRO–seq method a decade after its use in cell populations. Our efforts were met with two significant challenges: selectively capturing a small fraction of nascent RNA among various RNA species within a cell and accurately distinguishing nascent RNAs from individual cells.

The primary limitation we encountered was capture efficiency. The quantity of nascent RNA from transcribing RNA polymerases in an individual cell, mainly due to the intermittent nature of transcription with short bursts and long latency periods, is significantly lower than the mRNA copies that accumulate over time. Traditional nascent RNA capture methods yield only a meagre number of nascent RNAs from single cells. Miniaturizing GRO–seq using strategies derived from scRNA-seq was not feasible because nascent RNA lacks the consensus polyadenylation sequence used in RNA-seq. Instead, GRO–seq and related methods selectively label nascent RNA in bulk cells using modified nucleotides and use single-stranded RNA–RNA ligation with PCR handles on both ends. This ligation process proved unsuitable for scGRO–seq owing to its low efficiency and the need for nascent RNA purification before ligation, which risks depleting the already scarce nascent RNA from single cells.

To overcome these challenges, we devised a strategy that involved labelling nascent RNA in cells and attaching single-cell barcodes to the labelled nascent RNA without requiring purification from other cellular RNA. After exploring several approaches without success, we turned to click chemistry, specifically CuAAC. We speculated that by sourcing or synthesizing CuAAC-compatible chain-terminating nucleotide triphosphate analogues and performing nuclear run-on with the modified nucleotides to selectively label nascent RNA, we could label nascent RNA from individual cells with 5′-AzScBc DNA with a PCR handle. Then, we could pool the barcoded nascent RNA from multiple cells for selective RT in the presence of a TSO and subsequent PCR amplification for sequencing.

To successfully implement this strategy, we identified three important biochemical hurdles to address. First, we needed to demonstrate the ability of native RNA polymerase to incorporate 3′-(O-propargyl)-NTPs during nuclear run-on reactions. Second, preserving the intactness of nuclei during the run-on reaction was essential to enable the separation of individual nuclei for single-cell barcoding. Finally, we had to confirm the ability of reverse transcriptase to traverse the triazole ring junction formed during CuAAC. Successful resolution of the first and third hurdles would pave the way for CuAAC-based nascent RNA sequencing in cell populations, whereas overcoming the second hurdle would establish the foundation for scGRO–seq.

Development of AGTuC

To develop a nascent RNA tagging method suitable for capturing a small fraction of RNA from single cells, we initiated our approach by focusing on a cell-population-based strategy. We aimed to develop an enhanced nascent RNA tagging method that optimally integrates selective labelling and single-cell barcode tagging, bypassing the need for RNA purification. Among the tested methods, we identified click chemistry as the most suitable option because of its high selectivity, efficiency, robustness in diverse experimental conditions, cost-effectiveness and speed. Our goal was to selectively label nascent RNA through a nuclear run-on reaction, conjugate a single-stranded DNA PCR handle (that can accommodate a single-cell barcode for future use in single-cell analysis), reverse transcribe the RNA–DNA conjugate and prepare a NGS library.

To achieve single-nucleotide resolution of transcribing polymerases and efficient RT, we identified two click-chemistry-compatible, chain-terminating nucleotides with a relatively small functional group: 3′-(O-propargyl)-ATP and 3′-azido-3′-dATP (Extended Data Fig. 1a). Nascent RNA labelled with 3′-(O-propargyl)-NTPs forms a 1,4-disubstituted 1,2,3-triazole junction with azide-labelled DNA through CuAAC, as shown in Click-Code-Seq⁵⁰, whereas nascent RNA labelled with 3′-azido-3′-dNTPs forms a slightly bulkier junction with dibenzocyclooctyne labelled DNA through strain-promoted alkyne-azide cycloadditions (Extended Data Fig. 1b). Nuclear run-on with 3′-(O-propargyl)-ATP and CuAAC showed superior efficiency compared with 3′-azido-3′-dATP and strain-promoted alkyne-azide cycloadditions (Extended Data Fig. 1c).

To convert the clicked RNA–DNA conjugate to cDNA, we tested eight different reverse transcriptase enzymes, varied the temperature and duration of RT and evaluated three TSOs (Extended Data Fig. 1d–f, some results not shown). Our optimized method, which we AGTuC, was then performed in 5 million mouse ES cell nuclei. AGTuC nascent RNA profiles closely resembled PRO–seq profiles (Extended Data Fig. 2a) and exhibited strong correlations at both gene and enhancer levels (Extended Data Fig. 2b,c). Notably, the AGTuC library protocol involved significantly fewer steps than PRO–seq and could be completed in a single day (Extended Data Fig. 2d). AGTuC is a simpler, faster and cheaper alternative to GRO–seq and PRO–seq for nascent RNA sequencing from cell populations.

Development of inAGTuC

To adapt CuAAC-mediated nascent RNA sequencing to single cells, we explored the feasibility of performing AGTuC in single cells. Implementing AGTuC at the single-cell level presented challenges as the nuclear run-on reaction with 0.5% sarkosyl disrupts the nuclear membrane before cell barcodes could be attached during the post-run-on CuAAC step, which leads to unintended mixing of nascent RNA from different cells. One potential solution was to perform AGTuC in single tubes, which would prevent nascent RNA mixing. However, this approach requires RNA purification after the run-on reaction, but purification results in further depletion of exceedingly low amounts of nascent RNA in single cells. Alternatively, omitting RNA purification would lead to an abundance of 3′-(O-propargyl)-NTPs supplied in excess during the run-on reaction, which could outcompete 5′-AzScBc DNA during CuAAC.

To address this challenge, we developed inAGTuC, a new strategy that enables labelling nascent RNA with 3′-(O-propargyl)-NTPs while preserving nuclear integrity. This approach overcomes the issues associated with nascent RNA mixing before single-cell barcoding. We proposed that performing the run-on reaction without disrupting the nuclear membrane would facilitate the easy removal of excess nucleotides through a few centrifugation and resuspension steps while retaining propargyl-labelled nascent RNA within the nuclei. This approach would produce clean nuclei with labelled nascent RNA, free from excess reactive nucleotides, which could be compartmentalized with 5′-AzScBc DNA for CuAAC. We could minimize further RNA loss by pooling and processing the single-cell-barcoded nascent RNA from multiple cells.

To achieve an efficient run-on reaction, PRO–seq and AGTuC disrupt the polymerase complex with 0.5% sarkosyl detergent, of which nuclear membrane lysis is collateral damage. We sought to identify the lowest sarkosyl concentration that maintains nuclear membrane integrity while maximizing run-on efficiency and found that a 20× reduction in sarkosyl concentration preserved nuclear intactness, with only a 20% reduction in run-on efficiency (Extended Data Fig. 3a,b). To maximize the capture efficiency of nascent RNA, we optimized the molecular crowding effect of PEG 8000 and the ratio of Cu(I) to the CuAAC accelerating ligand BTTAA (Extended Data Fig. 3c). Although a low sarkosyl concentration preserves nuclear integrity, it also retains the RNA polymerase complex intact, thereby shielding the propargyl-labelled 3′ end of nascent RNA from reacting with 5′-AzScBc DNA. We investigated nascent RNA release from the RNA polymerase complex using common denaturants and found that 6 M urea and TRIzol was efficient (Extended Data Fig. 3d). However, the denaturant in TRIzol hindered CuAAC reaction (Extended Data Fig. 3e). Notably, urea also offered the added benefit of retaining the RNA–DNA conjugate in the aqueous phase during TRIzol clean-up to remove PEG 8000 from the CuAAC reaction (Extended Data Fig. 3f). For reaction clean-up, we assessed various methods, finding cellulose membrane to be effective in removing CuAAC reagents (Extended Data Fig. 3g), whereas silica matrix columns performed well in retaining RNA and ssDNA (Extended Data Fig. 3h). Subsequently, we evaluated DNA polymerase for library preparation and DNA size-selection methods (Extended Data Fig. 3i,j).

Considering the goal of working with single cells, we performed inAGTuC with cell numbers between 5 million used in AGTuC and 1 cell planned for scGRO–seq. Specifically, we placed 100 to 1,000 intact nuclei in each well of a 96-well plate containing urea. Nascent RNA in each well was barcoded with a unique 5′-AzScBc DNA by CuAAC and pooled from the 96 wells, and a sequencing library was prepared as in AGTuC. The inAGTuC libraries exhibited similar profiles in gene bodies compared with PRO–seq and AGTuC. However, they could not capture the paused peaks at the 5′ end of genes and enhancers (Extended Data Fig. 4a–c). This observation is consistent with the need for a higher sarkosyl concentration for efficient run-on of paused polymerase complexes. The four inAGTuC libraries correlated well with each other (Extended Data Fig. 4d), with the potential to discover more insights with deeper sequencing (Extended Data Fig. 4e,f). Despite only partially capturing nascent RNA from a paused complex, the inAGTuC libraries correlated well with those from AGTuC and PRO–seq (Extended Data Fig. 4g).

To systematically characterize the compatibility of inAGTuC with even fewer cells, we prepared four inAGTuC libraries in a 96-well plate, with 12 c.p.w., 120 c.p.w. and 1,200 c.p.w., which is roughly equivalent to 1,000, 10,000 and 100,000 nuclei, respectively. We also included a 1,200 c.p.w. plate, omitting Cu(I) as a negative control. Despite lower coverage, the inAGTuC library with 12 c.p.w. (total of about 1,000 cells) successfully captured the overall nascent RNA profile. It exhibited a good correlation with 120 c.p.w. (total of about 10,000 cells) and 1,200 c.p.w. (total of around 100,000 cells) (Extended Data Fig. 5a–c).

3′-(O-propargyl)-nucleotide synthesis

For this study, several CuAAC-compatible nucleotide analogues modified with azide or alkyne functionalities were evaluated. Ultimately, 3′-(O-propargyl)-NTPs were selected for three main reasons: (1) these analogues lack 3′ hydroxyl groups, making them chain-terminating and enabling single-nucleotide resolution of the 3′ end of nascent RNA; (2) the CuAAC reaction produces a compact junction due to the presence of a single carbon bond between the sugar group of the nucleotide and the propargyl group at the 3′ end position; and (3) they are relatively cost-effective compared with biotin-modified nucleotides commonly used in PRO–seq.

3′-(O-Propargyl)-ATP (NU-945) was offered by Jena Biosciences. To complete the set, custom synthesis requests were made for 3′-(O-propargyl)-CTP (NU-947), 3′-(O-propargyl)-GTP (NU-946) and 3′-(O-propargyl)-UTP (NU-948), all of which are now available for purchase from Jena Biosciences.

Single-cell barcoded DNA adaptors

During scGRO–seq development, 3 sets of 96 5′-AzScBc DNA were synthesized by GeneLink. Each design encompassed four components: a 5′ azide positioned at the 5′ terminus, a 10–12 nucleotide sequence for the single-cell barcode, a 4–6 nucleotide sequence for the UMI and a PCR handle. The 5′ azide modification was obtained following a previously described method⁵¹. Specifically, an oligonucleotide containing 5′ iodo-dT was synthesized through solid-support phosphoramidite oligonucleotide synthesis, and subsequent replacement of the iodo group with an azide group was achieved through a reaction with sodium azide at 60 °C for 1 h. The sequences of three different 5′-AzScBc DNA are available in Supplementary Table 7.

The hairpin structure of the 86-nucleotide 5′-AzScBc DNA (Supplementary Fig. 3a) is formed through self-folding. The RT process is initiated using the 3′ end of the oligonucleotide, which serves as a built-in primer. This design ensures a 1:1 stoichiometry between the PCR handle and the RT primer, minimizing mispriming and nonspecific amplification during RT. The folded hairpin structure also generates a restriction site for the EagI enzyme, which is digested before PCR amplification.

Undesired extension by reverse transcriptase is effectively prevented by a three-carbon spacer at the 3′ end of the 43-nucleotide 5′-AzScBc DNA⁵². This version of the azide adaptor harbours a 5-nucleotide ACAGG sequence after the azide-dT at its 5′ end (Supplementary Fig. 3b). During RT, the extension of primers annealing to unclicked 5′-AzScBc, the addition of non-templated CCC and the incorporation of TSO results in undesired cDNA that are preferred substrates for PCR amplification. If unaddressed, these amplicons can overwhelm the sequencing library. The ACAGG sequence plays a crucial role in depleting these PCR amplicons.

A previously described method named DASH uses recombinant Cas9 protein and gRNA complex to digest and deplete undesired dsDNA⁵³. The ACAGG sequence is necessary to generate a gRNA target sequence in the undesired PCR amplicons (underlined sequence). In PCR amplicons formed between nascent RNA and 5′-AzScBc DNA, the complementation of gRNA is interrupted by the presence of a nascent RNA sequence, which makes the desired products incompatible with DASH. AGG serves as the protospacer adjacent motif.

Cell line

The V6.5 mouse ES cells used in this study were established by the Jaenisch Laboratory (Whitehead Institute, Massachusetts Institute of Technology) from the inner cell mass of a 3.5-day-old mouse embryo from a C57BL/6(F) × 129/sv(M) cross.

Cell culture

Mouse ES cells were cultured in Dulbecco’s modified Eagle medium (Gibco, 11995), plus 10% fetal bovine serum (HyClone, SH30070.03), supplemented with 1× penicillin–streptomycin (Gibco, 15140), 1× non-essential amino acids (Gibco, 1140), 1× l-glutamine (Gibco, 25030), 1× β-mercaptoethanol (Sigma, M6250) and 1,000 U ml^–1 leukaemia inhibitory factor (Sigma, ESG1107) on tissue-culture-treated 10 cm plates (Corning, CLS430167) pre-coated with 0.2% gelatin (Sigma, G1890) prepared in PBS (Fisher, MT21031CV). Cells were grown at 37 °C and 5% CO₂ and passed with HEPES buffered saline solution (Lonza, CC-5024) and 0.25% trypsin-EDTA (Gibco, 25200) when 70% confluency was reached (every 2 days).

Sample preparation

Tissue culture cells were prepared for nuclear run-on reaction by either nuclei isolation or cell permeabilization as described below. All centrifugation steps were performed at 1,000g for 5 min. Cells were collected by removing the tissue culture medium, rinsing with PBS and placing the plates on ice. Cells were scraped while still on ice. The cells were collected into a 15 ml conical tube and centrifuged at 1,000g for 5 min.

For nuclei isolation, the pellet was resuspended in ice-cold douncing buffer (10 mM Tris-Cl pH 7.4, 300 mM sucrose, 3 mM CaCl₂, 2 mM MgCl₂, 0.1% Triton X-100, 0.5 mM DTT, 0.1× Halt protease inhibitor and 0.02 U µl^–1 RNase inhibitor) and transferred to a 7 ml dounce homogenizer (Wheaton, 357542). After incubation on ice for 5 min, the cells were dounced 25 times with a tight pestle, transferred back to the 15 ml conical tube and centrifuged to pellet the nuclei. The pellet was washed twice in a douncing buffer.

For cell permeabilization, the pellet was resuspended in ice-cold permeabilization buffer (10 mM Tris-Cl pH 7.4, 300 mM sucrose, 10 mM KCl, 5 mM MgCl₂, 1 mM EGTA, 0.05% Tween-20, 0.1% NP-40, 0.5 mM DTT, 0.1× Halt protease inhibitor and 0.02 U µl^–1 RNase inhibitor). After incubation on ice for 5 min, the cells were centrifuged to pellet the nuclei. The pellet was washed twice in the permeabilization buffer.

The washed pellet was resuspended in storage buffer (10 mM Tris-Cl pH 8.0, 5% glycerol, 5 mM MgCl₂, 0.1 mM EDTA, 5 mM DTT, 1× Halt protease inhibitor and 0.2 U µl^–1 RNase inhibitor) at a concentration of 5 × 10⁶ nuclei per 50 µl of storage buffer, flash-frozen in liquid nitrogen and stored at −80 °C. The nuclei and permeabilized cells in the storage buffer can be stored for up to 5 years at −80 °C, making them readily available for nuclear run-on experiments.

Nuclear run-on with 3′-(O-propargyl)-nucleotides

A volume of 50 µl of 2× nuclear run-on buffer (20 mM Tris-Cl pH 8.0, 10 mM MgCl₂, 400 mM KCl, 50 µM 3′-(O-propargyl)-ATP, 50 µM 3′-(O-propargyl)-CTP, 50 µM 3′-(O-propargyl)-GTP, 50 µM 3′-(O-propargyl)-UTP, 0.05% Sarkosyl, 1 mM DTT, 2× Halt protease inhibitor and 0.4 U µl^–1 RNase inhibitor) was prepared per sample and heated to 37 °C. Once thawed from −80 °C, permeabilized cells or nuclei were added to the heated tube containing nuclear run-on buffer and incubated for 5 min at 37 °C with gentle tapping at the incubation midpoint. Permeabilized cells or nuclei were centrifuged at 500g for 2 min at 4 °C, and the supernatant was aspirated off. The pellet was washed 3 times in 150 µl resuspension buffer (5 mM Tris-Cl pH 8.0, 2.5% glycerol, 2.5 mM MgAc₂, 0.05 mM EDTA, 1.25 mM MgCl₂, 60 mM KCl, 3 mM DTT, 0.2× Halt protease inhibitor and 0.2 U µl^–1 RNase inhibitor). After the final wash, the permeabilized cells or nuclei were resuspended in a 2 ml resuspension buffer and passed through a 35 µm nylon mesh (Falcon, 352235).

Single-cell sorting and nuclei sorting

For single-cell and nuclei sorting, 96-well plates with 2.5 µl 8 M urea were prepared using a multichannel or 96-well pipettor (Avidien MicroPro 300, 30835029). Single cell and nuclei populations characterized by forward and side scattering were sorted by FACS into the 96-well plate containing urea. The sorted plates can be used in CuAAC directly or sealed with aluminium foil or a plastic seal and stored at −80 °C.

CuAAC

A 96-well plate containing 5′-AzScBc DNA with a unique cell barcode in each well previously synthesized and aliquoted was thawed from −80 °C. Sodium ascorbate, PEG 8000, CuSO₄ and accelerating ligand BTTAA were prepared and dispensed into each well of the 96-well plate containing 5′-AzScBc DNA. The CuAAC reaction mix was dispensed into individual wells containing single cells in urea using a multichannel or 96-well pipette. The final concentration of CuAAC reaction in each well was 30 nM 5′-AzScBc DNA, 800 mM sodium ascorbate, 15% PEG 8000, 1 mM CuSO₄, 5 mM BTTAA and 2.66 M urea in a 7.5 µl volume. The 96-well plates were sealed, vortexed for 10 s in an orbital vortexer and centrifuged for 1 min at 500g before incubation for 2 h at 50 °C.

After incubation, the CuAAC reaction was quenched with 5 mM EDTA and pooled from 96 wells into a 1.5 ml Eppendorf tube. PEG 8000 was removed using TRIzol. The remaining CuAAC reagents (sodium ascorbate, CuSO₄ and BTTAA) were removed with a centrifugal filter with 3 kDa cellulose membrane (Amicon, 2020-04). The purified RNA was fragmented with 10 mM ZnCl₂ for 5 min at 65 °C.

RT through the triazole link and pre-amplification

RT of the clicked RNA–DNA conjugate was performed with highly processive Moloney murine leukaemia virus (M-MuLV) reverse transcriptase lacking RNase H activity but capable of RNA-dependent and DNA-dependent polymerase activity, non-templated addition and template switching (Thermo Fisher, EP0751). RT reaction (1× RT buffer, 0.5 mM dNTPs, 0.8 U µl^–1 RNase inhibitor, 16% PEG 8000, 1 µM RT primer (except for hairpin-forming 5′-AzScBc DNA), and 1 µm TSO) was incubated with the RNA–DNA conjugate for 2 h at 50 °C. The cDNA was size-selected in 10% denaturing PAGE away from the unclicked 5′-AzScBc DNA and empty cDNA formed between the 5′-AzScBc DNA and TSO.

The purified cDNA was PCR amplified for 6 cycles to generate dsDNA with NEBNext Ultra II Q5 High-Fidelity 2× master mix (NEB, M0544) and 0.5 µM PCR primers with unique dual index using the PCR cycles presented in Supplementary Table 8.

Removal of empty adaptors using DASH

The dsDNA from the pre-amplification of cDNA was subjected to DASH to remove the undesired amplicons formed by RT of unclicked 5′-AzScBc DNA and TSO, as described above. Cas9–gRNA complex (6.6 µM Streptococcus pyogenes Cas9 nuclease (NEB, M0386T), 20 µM gRNA, 1× NEBuffer r3.1 and nuclease-free duplex buffer (IDT, 11-05-01-04)) was prepared by incubation for 15 min at 25 °C. The incubated complex was added to the cleaned PCR reaction and incubated for 1 h at 37 °C.

PCR amplification and NGS

The DASHed library was PCR amplified with NEBNext Ultra II Q5 High-Fidelity 2× master mix (NEB, M0544) and 0.5 µM PCR primers with a unique dual index using the two-step PCR cycles presented in Supplementary Table 9.

The NGS library was sequenced on Illumina NovaSeq SP100 flow cells with 64 nucleotides forward read, 43 nucleotides reverse read, 8 nucleotides index 1 and 8 nucleotides index 2.

Alignment and pre-processing

Adaptor sequences were removed from paired-end fastq files using Cutadapt⁵⁴. In brief, the read 1 sequence CCCCTGTCTCTTATACACAT and the read 2 sequence AGATCGGAAGAGCGTCGTGT were trimmed with a maximum error rate of 0.15, requiring a minimum overlap of 12 nucleotides between the read and adapter. The resulting adapter-trimmed reads were demultiplexed using Flexbar⁵⁵. Cell barcodes and UMIs were extracted from the 5′ end of read 1, applying a barcode error rate of 0.15 and retaining reads of at least 14 nucleotides in length. The adapter-clipped and demultiplexed reads were first mapped to the mouse ribosomal genome using bowtie2 (ref. ⁵⁶) in –very-sensitive mode. The reads unmapped to the ribosomal genome were mapped to the mouse genome (mm10 build) in –very-sensitive mode. After mapping, duplicate reads were identified and removed utilizing UMI and mapping coordinates with UMI-tools⁵⁷.

Filtering experimental batches and cells

The scGRO–seq batches with r² values of at least 0.6 against at least 60% of all batches were selected for further analysis. Cells were required to contain a minimum of 1,000 UMIs and 750 features for further analysis. Our study involved 17 batches of scGRO–seq experiments across 39 96-well plates, encompassing a total of 3,744 cells. Of these, 36 plates (each containing a minimum of 24 high-quality cells) and 2,635 cells met the threshold.

Estimation of capture efficiency

The average capture efficiency of scGRO–seq was estimated to be approximately 10%. We used data from the intron seqFISH study¹⁷, which quantified the abundance of 34 introns by single-molecule fluorescent in-situ hybridization (smFISH). Based on the slope of the line of best fit between data from smFISH and intron seqFISH, the detection efficiency of intron seqFISH was estimated to be 44%. When scGRO–seq was compared with intron seqFISH, the detection efficiency of scGRO–seq was 26% of intron seqFISH. Based on these two detection efficiencies, the estimated capture efficiency of scGRO–seq is about 10% (26% of 44% is approximately 10%). This estimate is based on the 8 min of median time required for intron to be spliced out once it is transcribed, which ranges from 5 to 10 min according to several studies using diverse methods^{58,59,60,61,62,63,64}. Thus, the capture efficiency of 10% is an average approximation and can vary among cells and batches.

Enhancer annotation

Active transcription regulatory elements (TREs) in mouse ES cells were identified with PRO–seq data using dREG⁶⁵. Further filtering of the dREG results, carried out to eliminate TREs within or proximal to 1,500 bp of the RefSeq annotated genes (n = 23,980), identified 68,299 high-confidence TREs. The remaining TREs within 500 bp of each other were combined, which resulted in the final list of 12,542 enhancers. To capture nascent RNA derived from elongating RNA polymerases at these enhancers, the TREs were extended at least 1500 bp from the TSS in both directions. The overlapping enhancers were stitched together after extension.

Transcription unit calling

groHMM (https://www.bioconductor.org/packages/release/bioc/vignettes/groHMM/inst/doc/groHMM.pdf) was used to call de novo transcription unit on PRO–seq data. All combinations of tuning parameters (−50, −100, −200, and −400 for LP and 5, 10, and 15 for UTS) were tested. LP represents the ‘log-transformed transition probability of switching from transcribed state to non-transcribed state’, and UTS represents ‘the variance of the emission probability for reads in the non-transcribed state’. In our test, −50 LP and 10 UTS performed best for optimal transcription unit calling.

Evidence of bursting

Transcriptional bursting was examined de novo using scGRO–seq data by measuring two parameters: the multiplicity of RNA polymerases and the distance between the RNA polymerases. The bursting model suggests that transcription occurs in short bursts punctuated by long silent periods, which results in on and off states. The alternative model is the relatively uniform transcription initiation by primarily solitary RNA polymerase. We expected two observations under the bursting model.

First, we expected a higher incidence of more than one RNA polymerase per burst and a concurrent depletion of single RNA polymerases. To test the evidence of bursting, we selected genes longer than 11 kb (n = 13,564) and trimmed 0.5 kb regions from the 5′ and 3′ ends of the gene that are known to harbour paused polymerases. With an average transcription rate of 2.5 kb min^–1, the remaining 10 kb region resulted in an observation window of 4 min. Based on the evidence of monoallelic transcription described in the main text and a short observation window of 4 min, we assigned all signals for a gene in individual cells to one allele. We quantified the observed incidence of zero, one (singlets) and more than one RNA polymerase (multiplets) per allele. The majority of alleles had zero polymerase. To calculate the expected incidences of RNA polymerases under the non-bursting model, we permuted the cell identity of scGRO–seq reads 200 times without changing the read positions. The permutation maintains the number of UMIs per cell, breaks the bursting-mediated association between RNA polymerases, and mimics the RNA polymerases distribution under the non-bursting model. We quantified the permuted incidences of zero, singlets and multiplets.

Second, if more than one RNA polymerase is observed in the burst window, either due to transcriptional bursting or random chance, we expected the transcription bursting model would result in more closely spaced molecules than expected by the random chance. We took all multiplets in observed or permuted data and calculated the distance between RNA polymerase molecules within each pair. We binned the distances in 50 bp bins and calculated the ratio of RNA polymerase pairs between the observed and permuted data.

Burst kinetics

Genes over 11 kb (n = 13,564) were selected for studying transcriptional bursting kinetics, and 500 nucleotide regions at both ends known to harbour paused polymerases were truncated. In cases in which genes exceeded 10 kb after trimming, they were shortened to 10 kb starting from the initiation site of the gene. With an average transcription rate of 2.5 kb min^–1, this 10 kb burst window served an average burst duration of 4 min. The calculation of burst size and burst frequency proceeded as described below.

Burst size

For each gene, the number of cells with at least one read within the 10 kb burst window (number of bursts) was identified, and then the average UMIs per burst was computed. If a consistent single read per burst was observed, the burst size of that gene was set to 1. However, if the average burst size was 1.2, the residual burst above 1 indicated a higher burst size. Accounting for the 10% capture efficiency, wherein the likelihood of capturing paired reads within a burst window is 1%, the residual burst was proportionally adjusted by the capture efficiency. The equation for the burst size is shown in Supplementary Fig. 4 (top).

Burst frequency

For each gene, the burst frequency was determined as the number of bursts per allele (two alleles in autosomal and one in sex chromosomes) per transcription time. The transcription time was calculated as the duration needed to traverse the 10 kb burst window with a uniform transcription rate of 2.5 kb min^–1, translating to 4 min. The calculated burst frequency was normalized by the capture efficiency, taking the burst size into account. Although burst events with a larger burst size, like ten, would be consistently detected even with 10% capture efficiency, normalization was applied for cases in which a burst size like four would result in a 60% false negative rate, which indicated a non-existent burst despite active bursting. Thus, burst frequency normalization was scaled by burst size to ensure accurate quantification. The equation for the burst frequency is shown in Supplementary Fig. 4 (bottom).

Genes with core promoter elements like TATA and Initiator sequences were retrieved from the Eukaryotic Promoter Database (http://epd.vital-it.ch)⁶⁶. Genes containing a pause button, a sequence associated with promoter–proximal paused RNA polymerase, were recovered from the CoPRO dataset⁶⁷.

Simulation of idealized burst kinetics

We simulated read counts for populations of single cells to evaluate the performance of our estimators for burst rate and size. In the first simulation, we randomly generated the true burst size (T_size) for all human genes from a normal distribution (mean = 2, standard deviation = 3). Similarly, we generated true burst rates (T_rate) for all human genes from a normal distribution (mean = 1, standard deviation = 1). T_size less than 1 was corrected to 1, and T_rate less than 0.1 burst per hour was corrected to 0.1. These parameters were used to simulate UMIs per gene per cell as follows:

1.

For each cell and each gene, a sample from a Poisson distribution with rate parameter λ = T_rate.
2.

Scale the sampled burst by T_size and round to the nearest integer.
3.

After generating molecule counts for all genes and all cells, randomly subsample to a specified level (for example, 10% sampling efficiency) without replacement.

In the second simulation, T_size and T_rate were taken from our genome-wide estimates described in Fig. 2, and UMIs per gene per cell were similarly generated. Simulations were performed ten times to ensure consistent results.

Cell cycle analysis

Three sets of transcriptionally characterized genes were used to characterize the cell cycle phase in individual cells. Transcription of 68 replication-dependent histone genes on chromosome 3, chromosome 6, chromosome 11 and chromosome 13 were used to determine the S phase collectively. Transcription of four genes (Orc1, Ccne1, Ccne2 and Mcm6) were used to assign G1/S phase, and six genes (Wee1, Cdk1, Ccnf, Nusap1, Aurka and Ccna2) were used to assign G2/M phase. Cells with more than a read in one of the genes or reads in more than one gene were hierarchically clustered, which revealed three major clusters of the cell-cycle-phase-specific transcription pattern. The other three smaller clusters without distinct transcription patterns were not considered for downstream analyses. Differentially expressed genes among G1/S, S and G2/M phases of the cell cycle were identified using the ‘FindAllFeatures’ function of Seurat⁶⁸ (single-cell analysis package).

Gene–gene co-transcription

The co-transcription of genes was determined using two criteria: correlation and permutation. scGRO–seq reads were collected from up to the first 10 kb of genes after 500 bp regions at both ends were trimmed (n = 15,666). The genes by cells expression matrix was binarized. For the correlation approach, pairwise correlation was performed for all gene pairs, and the P value was calculated using the chi-square test. It was adjusted for multiple hypothesis tests using the Benjamini–Hochberg correction method.

Permutation was performed by shuffling the cell identifiers of reads while maintaining their gene assignments. The permutation method accounts for several unknown and known biases and, more importantly, maintains the number of reads in each cell. The observed and permuted co-transcription frequencies of gene pairs were calculated. The empirical P value for a gene pair was determined by counting the incidence of equal or higher co-transcription frequency in 1,000 permutations compared with the observed co-transcription frequency.

Gene pairs with correlation coefficients of greater than 0.1 and multiple hypothesis corrected P values of less than 0.05 from the correlation approach and an empirical P value of less than 0.05 from the permutation approach were considered co-transcribed. A network of pairwise co-transcribed genes was created using the Leiden algorithm, and the modules were selected for gene ontology analyses using the clusterProfiler R package.

Enhancer–gene co-transcription

Enhancer–gene co-transcription was determined following the logic of gene–gene co-transcription, substituting genes on one arm with enhancers. scGRO–seq reads were collected from up to the first 10 kb of genes after 500 bp regions at both ends were trimmed, and from at least a 3 kb region around enhancers (1,500 bp sense and 1,500 bp antisense) after a 500 bp region around the TSS was removed to avoid paused polymerases. Strand-specific reads on either side of the enhancer TSS were combined to determine enhancer expression. The features (genes + enhancers) by cell expression matrix was binarized, and the co-transcribed enhancer–gene pairs were determined using the correlation and permutation tests, similar to the approach used in the gene–gene co-transcription calculation. The UMIs per cell are maintained in each permutation. Enhancer–gene pairs only from the same chromosomes were retained for downstream analyses. We also included non-overlapping SEs identified in mouse ES cells.

Enhancers of pluripotency factors

Validated enhancers associated with pluripotency transcription factors OCT4 (also known as POU5F1), SOX2, Nanog and KLF4 were collected from studies referenced in the main text. To define time bins within genes, genes were divided into 5 kb bins (2-min bins calculated using the 2.5 kb min^–1 constant transcription rate of elongating RNA polymerases) in the sense and antisense direction until the end of the transcription wave called by groHMM⁶⁹, or they overlapped bins from other genes. For enhancers, the TSS was first determined based on the strongest OCT4, SOX2 and Nanog chromatin immunoprecipitation and sequencing (ChIP–seq) peaks. The precise position was determined by evaluating the divergent transcription around them. The reads from corresponding bins in sense and antisense directions were combined.

CRISPR-validated SEs

A set of validated SEs and their target genes were used from a previously published study referenced in the main text. SEs in gene introns or associated with miRNA were excluded due to the ambiguity in assigning reads and short gene length, respectively. For the time bin analyses, genes and SEs were divided into four 5 kb bins (2-min with the 2.5 kb min^–1 constant transcription rate of elongating polymerases) in the sense and antisense direction, limiting the analyses to the first 20 kb. Using a 20 kb region in this analysis yields four 5 kb bins. The TSS was first determined based on the strongest OCT4, SOX2 and Nanog ChIP–seq peaks, and precise position was determined by evaluating the divergent transcription around them. The reads from corresponding bins in sense and antisense directions were combined. The scrambled random pairs in SE–gene time bin analysis represent the co-transcribed bins between SEs and genes that are not the verified pairs.

External data

Various data types were analysed, compared and benchmarked against this study. PRO–seq data (GSE169044), ChIP data for p300 (GSM2360934), ATAC–seq (GSE169044), CDK9 (GSM1082347), RNA PolII (GSM318444), H3K4me1 (GSM281695), H3K4me3 (GSM1082344), H3K27Ac (GSM594579), OCT4 (GSM1082340), SOX2 (GSM1082341) and Nanog (GSM1082342) were downloaded from the Gene Expression Omnibus database. PRO–seq libraries were prepared using the same cells used for scGRO–seq under identical conditions⁷⁰. Intron seqFISH data on mouse ES cells were downloaded from table S1 of ref. ¹⁷. The genes-by-cells intron seqFISH matrix was binarized, and burst frequency was calculated assuming the signal in each gene comes from a burst equivalent to the 10 kb region used in scGRO–seq, given the probes were designed against the introns at the 5′ regions of genes. Mouse ES cell scRNA-seq was used from a previous study⁷, and the burst kinetics was downloaded from 41586_2018_836_MOESM5_ESM.xlsx file associated with this study.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Latest Posts