20190909_Count3ssRatioExample

Last updated: 2019-09-18

Checks: 6 1

Knit directory: cheRNA_pilot/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190813)

The command set.seed(20190813) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 3d9467f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   analysis/20190909_Count3ssRatioExample.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	3d9467f	Benjmain Fair	2019-09-18	update 3ss classification
html	3d9467f	Benjmain Fair	2019-09-18	update 3ss classification
Rmd	46e6b59	Benjmain Fair	2019-09-16	update site
html	46e6b59	Benjmain Fair	2019-09-16	update site
Rmd	b3b13e6	Benjmain Fair	2019-09-16	update site
html	b3b13e6	Benjmain Fair	2019-09-16	update site
Rmd	4ea43f8	Benjmain Fair	2019-09-16	update 3ss quantification analyses
html	4ea43f8	Benjmain Fair	2019-09-16	update 3ss quantification analyses
Rmd	78712fa	Benjmain Fair	2019-09-16	update site
html	78712fa	Benjmain Fair	2019-09-16	update site

library(tidyverse)

For each 3’ss, I used bedtools to count the number of reads overlapping the upstream 25bp window, and the downstream 25bp window. The ratio of these of these two counts is a measure of splicing efficiency, and the ratio of those ratios in nascent RNA-seq versus standard RNA-seq is a measure of how contranscriptional a splicing event is (see Herzel et al). Here I will look at these metrics using a standard RNA-seq dataset and nascent RNA-seq dataset from the same LCL line 18862.

I have already calculated counts in upstream and downstream windows for each annotated 3’ss using bedtools (see Snakemake code). Here I will investigate the results with a few plots. One result I expect to see is that alernative 3’ss will be spliced less contranscripionally. In the case of cassette exons, it is important to distinguish between the upstream 3’ss and the downstream 3’ss. Upstream 3’ss introns will naturally have less coverage in the downstream window (i.e. in the cassette exon).

First read in Gencode annotated introns and classify the 3’ss as alternative upstream, alternative downstream, or constitutive, based on how many introns are annotated with each 3’ss. This Gencode file of introns was made using gtf2leafcutter.pl script in leafcutter github repo, and it is a redundant file of all intron segments from Gencode gtf. In other words, there are many overlapping introns, including many with the same 3’ss. So here is the pseudo-code logic on how I classify introns:

Constitutive introns:

3’ss is paired to one and only one 5’ss in the annotations. The 5’ss that it is paired to is paired only to that 3’ss and no other 3’ss.

Downstream alternative introns (green):

3’ss is paired to multiple 5’ss in the annotations, and all of the 5’ss that it is paired to are only paired to that 3’ss. This will also include 3’ss that are associated with alt 5’ss usage.

Upstream alternative introns (blue):

3’ss is paired to a 5’ss that is used in more than 1 intron. This will also include 3’ss that are associated with alt 3’ss usage.

GencodeIntrons <- read.table("../data/GencodeHg38_all_introns.corrected.bed.gz", sep='\t', col.names = c('chrom', 'start', 'stop', 'name', 'score', 'strand', 'gene', 'intronNumber', 'transcriptType'))

SpliceSitesCounted <- GencodeIntrons %>%
  filter(transcriptType=="protein_coding") %>%
  distinct(chrom, start, stop, .keep_all = T) %>%
  mutate(Acceptor = case_when(strand == "+" ~ paste(chrom, stop, strand, sep="."),
                              strand == "-" ~ paste(chrom, start, strand, sep="."))) %>%
  mutate(Donor = case_when(strand == "+" ~ paste(chrom, start, strand, sep="."),
                           strand == "-" ~ paste(chrom, stop, strand, sep="."))) %>%
  add_count(Acceptor, name="AcceptorCount") %>%
  add_count(Donor, name="DonorCount")

#AcceptorCount is number of intron segments (unique Donor/Acceptor pairs) with that Acceptor 3'ss
#DonorCount is number of intron segments (unique Donor/Acceptor pairs) with that Donor 5'ss
#DonorMax is for each acceptor, what is the maximum DonorCount amongst all the donors it is paired with


Annotated <- SpliceSitesCounted %>%
  # filter(AcceptorCount==1 & DonorCount==1) %>% dim()
  group_by(Acceptor) %>%
  mutate(DonorMax=max(DonorCount)) %>%
  ungroup() %>%
  distinct(Acceptor, .keep_all = T) %>%
  mutate(Type3ss = case_when(AcceptorCount == 1 & DonorMax==1 ~ "Constitutive intron",
                             AcceptorCount >1 & DonorMax > 1 ~ "Downstream alternative intron",
                             AcceptorCount == 1 & DonorMax > 1 ~ "Upstream alternative intron",
                             AcceptorCount >1 & DonorMax==1 ~ "Unclassifiable"))

#By the above logic, "Unclassifiable"" means a 3ss that is associated with alt 5ss but not exon skipping, since none of the Donors it uses is used more than once.
table(Annotated$Type3ss)


          Constitutive intron Downstream alternative intron 
                       149839                         20115 
               Unclassifiable   Upstream alternative intron 
                         7003                         31735

# Write out as bed file to visually inspect classifications in IGV to check for bugs
# Manual inspection of the resulting bed file indiciates I am classifying introns as intended.
write.table_with_header <- function(x, file, header, ...){
  cat(header, '\n',  file = file)
  write.table(x, file, append = T, ...)
}

Annotated %>%
  mutate(NewName=paste(Acceptor, Type3ss, sep=".")) %>%
  
  # mutate(NewStart = case_when(strand == "+" ~ stop-25,
  #                             strand == "-" ~ start)) %>%
  # mutate(NewStop = case_when(strand == "+" ~ stop,
  #                          strand == "-" ~ start-25)) %>%
  # select(chrom, NewStart, NewStop, paste(Acceptor, Type3ss), ".", strand) %>% head()
  mutate(Color = case_when(Type3ss == "Constitutive intron" ~ "255,0,0",
                   Type3ss == "Downstream alternative intron" ~ "0,204,0",
                   Type3ss == "Upstream alternative intron" ~ "0,0,255",
                   Type3ss == "Unclassifiable" ~ "224,224,224")) %>%
  mutate(thickStart=start, thickEnd=stop, newScore=0) %>%
  select(chrom, start, stop, Acceptor, newScore, strand, thickStart, thickEnd, Color) %>%
  # write.table("~/Temporary/3ssAnnotations.bed", quote=F, sep='\t', col.names=F, row.names = F)
  write.table_with_header("~/Temporary/3ssAnnotations.bed", 'browser hide all\ntrack name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On"', quote=F, sep='\t', col.names=F, row.names = F)

Now, similarly to Herzel et al, set a minimum read count threshold cutoff for reliable 3’ss splice ratios to use for downstream analysis. Then do some plotting. More precisely, I will plot Intron retenetion percent spliced in (Intron retention PSI) which will be \(IntronRetentionPSI=\frac{UpstreamCoverage}{DownstreamCoverage}\).

Cutoff <- 50

RS <- read.table("../output/3ssCoverageBeds/NA18862_argonne.bed.gz", sep='\t', col.names = c('chrom', 'start', 'stop', 'name', 'score', 'strand', 'upstream', 'downstream'), stringsAsFactors = F) %>% filter(upstream + downstream >= Cutoff)
nRS <- read.table("../output/3ssCoverageBeds/18862_cheRNA_1.bed.gz", sep='\t', col.names = c('chrom', 'start', 'stop', 'name', 'score', 'strand', 'upstream', 'downstream'), stringsAsFactors = F) %>% filter(upstream + downstream >= Cutoff)

ToPlot <- RS %>%
  left_join(nRS, by="name") %>%
  inner_join((Annotated %>% filter(Type3ss != "Unclassifiable")), by=c("name" = "Acceptor"))%>%
  mutate(RS.ratio = upstream.x/(downstream.x)) %>%
  mutate(nRS.ratio = upstream.y/(downstream.y)) %>%
  mutate(NormalizedRatio = nRS.ratio/RS.ratio)

# Plot Intron retenetion PSI.
ggplot(ToPlot, aes(color=Type3ss)) +
  stat_ecdf(aes(x=nRS.ratio, linetype="nascent-RNA"), geom = "step") +
  stat_ecdf(aes(x=RS.ratio, linetype="polyA-RNA"), geom = "step") +
  ylab('Cumulative fraction') +
  xlab("Intron retention PSI\nMore spliced<----------->More unspliced") +
  scale_x_continuous(trans='log10', limits=c(0.001,1)) +
  theme_bw()

Version	Author	Date
3d9467f	Benjmain Fair	2019-09-18
78712fa	Benjmain Fair	2019-09-16

Chromatin associated RNA has higher \(Intron Retention PSI\) as expected. Alternative upstream introns also have \(Intron Retention PSI\) as expected, since there should be less coverage downstream of those 3’ss for cassette exons. Now quantify cotrnanscriptional splicing with a metric defined as \(\frac{Intron retention PSI_{nascent}}{Intron retention PSI_{polyA}}\).

# Plot cotranscriptional splicing
ggplot(ToPlot, aes(x=NormalizedRatio, color=Type3ss)) +
  stat_ecdf(geom = "step") +
  ylab('Cumulative fraction') +
  xlab("Cotranscriptional splicing ratio\nMoreCoTxnSplicing<---------->LessConTxnSplicing") +
  scale_x_continuous(trans='log10') +
  theme_bw()

Version	Author	Date
3d9467f	Benjmain Fair	2019-09-18
4ea43f8	Benjmain Fair	2019-09-16

ToTest <- ToPlot %>%
  filter(Type3ss %in% c("Constitutive intron", "Downstream alternative intron"))
wilcox.test(NormalizedRatio ~ Type3ss, data=ToTest)


    Wilcoxon rank sum test with continuity correction

data:  NormalizedRatio by Type3ss
W = 459050, p-value = 8.632e-07
alternative hypothesis: true location shift is not equal to 0

As expected from Herzel et al, 3’ss downstream of cassette exons (which are an intron-centric measure of splicing of the upstream cassette exon) are spliced slower than constitutive introns.

Next I should see how replicates compare, to see if this statistic is reasonably stable (eventually I could consider QTL mapping using this statistic).

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.1     purrr_0.3.2    
[5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.1.1  
[9] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       cellranger_1.1.0 plyr_1.8.4       pillar_1.4.1    
 [5] compiler_3.5.1   git2r_0.25.2     workflowr_1.4.0  tools_3.5.1     
 [9] digest_0.6.19    lubridate_1.7.4  jsonlite_1.6     evaluate_0.14   
[13] nlme_3.1-140     gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.2 
[17] rlang_0.3.4      cli_1.1.0        rstudioapi_0.10  yaml_2.2.0      
[21] haven_2.1.0      xfun_0.7         withr_2.1.2      xml2_1.2.0      
[25] httr_1.4.0       knitr_1.23       hms_0.4.2        generics_0.0.2  
[29] fs_1.3.1         rprojroot_1.3-2  grid_3.5.1       tidyselect_0.2.5
[33] glue_1.3.1       R6_2.4.0         readxl_1.3.1     rmarkdown_1.13  
[37] modelr_0.1.4     magrittr_1.5     whisker_0.3-2    backports_1.1.4 
[41] scales_1.0.0     htmltools_0.3.6  rvest_0.3.4      assertthat_0.2.1
[45] colorspace_1.4-1 labeling_0.3     stringi_1.4.3    lazyeval_0.2.2  
[49] munsell_0.5.0    broom_0.5.2      crayon_1.3.4

20190909_Count3ssRatioExample

Ben Fair

9/10/2019