Last updated: 2023-04-14
Checks: 6 1
Knit directory:
ChromatinSplicingQTLs/analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git. To know which version of the R
Markdown file created these results, you’ll want to first commit it to
the Git repo. If you’re still working on the analysis, you can ignore
this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20191126)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 2ee57a6. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.Rhistory
Ignored: analysis/figure/
Ignored: code/.DS_Store
Ignored: code/.RData
Ignored: code/._report.html
Ignored: code/.ipynb_checkpoints/
Ignored: code/.snakemake/
Ignored: code/APA_Processing/
Ignored: code/Alignments/
Ignored: code/ChromHMM/
Ignored: code/ENCODE/
Ignored: code/ExpressionAnalysis/
Ignored: code/FastqFastp/
Ignored: code/FastqFastpSE/
Ignored: code/FastqSE/
Ignored: code/FineMapping/
Ignored: code/Genotypes/
Ignored: code/H3K36me3_CutAndTag.pdf
Ignored: code/IntronSlopes/
Ignored: code/LR.bed
Ignored: code/LR.seq.bed
Ignored: code/LongReads/
Ignored: code/Metaplots/
Ignored: code/Misc/
Ignored: code/MiscCountTables/
Ignored: code/Multiqc/
Ignored: code/Multiqc_chRNA/
Ignored: code/NonCodingRNA/
Ignored: code/NonCodingRNA_annotation/
Ignored: code/PairwisePi1Traits.P.all.txt.gz
Ignored: code/PeakCalling/
Ignored: code/Phenotypes/
Ignored: code/PlotGruberQTLs/
Ignored: code/PlotQTLs/
Ignored: code/ProCapAnalysis/
Ignored: code/QC/
Ignored: code/QTL_SNP_Enrichment/
Ignored: code/QTLs/
Ignored: code/RPKM_tables/
Ignored: code/ReadLengthMapExperiment/
Ignored: code/ReadLengthMapExperimentResults/
Ignored: code/ReadLengthMapExperimentSpliceCounts/
Ignored: code/ReferenceGenome/
Ignored: code/Rplots.pdf
Ignored: code/Session.vim
Ignored: code/SmallMolecule/
Ignored: code/SplicingAnalysis/
Ignored: code/TODO
Ignored: code/Tehranchi/
Ignored: code/bigwigs/
Ignored: code/bigwigs_FromNonWASPFilteredReads/
Ignored: code/config/.DS_Store
Ignored: code/config/._.DS_Store
Ignored: code/config/.ipynb_checkpoints/
Ignored: code/config/config.local.yaml
Ignored: code/dag.pdf
Ignored: code/dag.png
Ignored: code/dag.svg
Ignored: code/debug.ipynb
Ignored: code/debug_python.ipynb
Ignored: code/deepTools/
Ignored: code/featureCounts/
Ignored: code/featureCountsBasicGtf/
Ignored: code/gwas_summary_stats/
Ignored: code/hyprcoloc/
Ignored: code/igv_session.xml
Ignored: code/isoseqbams/
Ignored: code/log
Ignored: code/logs/
Ignored: code/notebooks/.ipynb_checkpoints/
Ignored: code/pi1/
Ignored: code/rules/.ProcessSmallMoleculeData.smk.swp
Ignored: code/rules/.ipynb_checkpoints/
Ignored: code/rules/OldRules/
Ignored: code/rules/notebooks/
Ignored: code/scratch/
Ignored: code/scripts/.ipynb_checkpoints/
Ignored: code/scripts/GTFtools_0.8.0/
Ignored: code/scripts/__pycache__/
Ignored: code/scripts/liftOverBedpe/liftOverBedpe.py
Ignored: code/snakemake.dryrun.log
Ignored: code/snakemake.log
Ignored: code/snakemake.sbatch.log
Ignored: code/snakemake_profiles/slurm/__pycache__/
Ignored: code/test.introns.bed
Ignored: code/test.introns2.bed
Ignored: code/test.log
Ignored: code/tracks.xml
Ignored: data/.DS_Store
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-10.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-11.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-2.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-3.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-4.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-5.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-6.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-7.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-8.csv
Ignored: data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022.csv
Ignored: data/Metaplots/.DS_Store
Untracked files:
Untracked: analysis/20230413_CheckSplicingTimingOfSM_sensitiveIntrons.Rmd
Untracked: code/envs/maxentscan.yml
Unstaged changes:
Modified: analysis/20230324_MakeFigs.Rmd
Modified: code/rules/ProcessSmallMoleculeData.smk
Modified: code/scripts/GenometracksByGenotype
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish()
to start tracking its development.
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
library(edgeR)
Loading required package: limma
# Set theme
theme_set(
theme_classic() +
theme(text=element_text(size=16, family="Helvetica")))
# I use layer a lot, to rotate long x-axis labels
Rotate_x_labels <- theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
#test plot
ggplot(mtcars, aes(x=mpg, y=cyl)) +
geom_point()
Yang hypothesizes that cryptic exons induced by risdiplam will be in host introns that are spliced slower than typical introns. The idea is that the small molecule needs time to have its effect, and it is in competition with the splicing of the canonical host introns. To test this, I will assess the splicing timing of introns using the ratio of coverage in 25nt window upstream of 3’ss, versus 25 nt window downstream of 3’ss for each intron. I hypothesize that introns that host a risdiplam-induced cryptic exon will have higher upstream/downstream ratio than random set of introns. I will specifically look at the DMSO control samples of chRNA.
UpstreamCoverage <- read_tsv("../code/SmallMolecule/3ssWindows/Upstream.counts.txt") %>%
rename_at(-c(1:6), ~str_replace(.x, "SmallMolecule/AlignmentsPass2/(.+?)/Aligned.sortedByCoord.out.bam", "\\1")) %>%
dplyr::select(-c(1:3), -5, -6) %>%
column_to_rownames("name") %>%
as.matrix()
DownstreamCoverage <- read_tsv("../code/SmallMolecule/3ssWindows/Down.counts.txt") %>%
rename_at(-c(1:6), ~str_replace(.x, "SmallMolecule/AlignmentsPass2/(.+?)/Aligned.sortedByCoord.out.bam", "\\1")) %>%
dplyr::select(-c(1:3), -5, -6) %>%
column_to_rownames("name") %>%
as.matrix()
UpstreamCoverage.DMSO.chRNA <- UpstreamCoverage %>%
as.data.frame() %>%
dplyr::select(contains("DMSO") & contains("chRNA")) %>%
as.matrix()
DownstreamCoverage.DMSO.chRNA <- DownstreamCoverage %>%
as.data.frame() %>%
dplyr::select(contains("DMSO") & contains("chRNA")) %>%
as.matrix()
hist(log10(UpstreamCoverage + DownstreamCoverage + 1))
hist(log10(apply(UpstreamCoverage + DownstreamCoverage, 1, sum) + 1))
Let’s process samples an a per-sample basis, so we can use the replicates. Also, I will filter out 3’ss in each sample with less than 10 reads between the upstream and downstream window. Let’s add a small psueodcount to everything to not get divide by 0 errors.
RisdiplamSensitiveIntrons <- read_tsv("../output/SmallMoleculeGAGT_CassetteExonclusters.bed", col_names=c("chrom", "start", "stop", "name", "score", "strand", "thickStart", "thickEnd", "color")) %>%
filter(str_detect(name, "junc.skipping")) %>%
mutate(SpliceAcceptor = if_else(
strand == "+",
paste(chrom, stop, sep="_"),
paste(chrom, start, sep="_")
)) %>%
dplyr::select(GAGTIntron=name, name=SpliceAcceptor) %>%
mutate(GAGTIntron = str_replace(GAGTIntron, "junc.skipping_clu_.+?_(chr.+?)$", "\\1"))
Merged <- bind_rows(
DownstreamCoverage %>%
as.data.frame() %>%
dplyr::select(contains("DMSO") & contains("chRNA")) %>%
rownames_to_column("name") %>%
mutate(Window = "Down"),
UpstreamCoverage %>%
as.data.frame() %>%
dplyr::select(contains("DMSO") & contains("chRNA")) %>%
rownames_to_column("name") %>%
mutate(Window = "Up")
) %>%
pivot_longer(contains("DMSO"),names_to = "Sample") %>%
mutate(value = value + 0.1) %>%
pivot_wider(names_from="Window", values_from = "value") %>%
mutate(sum = Down + Up) %>%
filter(sum >= 10) %>%
mutate(Ratio = Up/Down) %>%
left_join(RisdiplamSensitiveIntrons) %>%
mutate(Group = if_else(is.na(GAGTIntron), "No SM-induced exon", "Contains SM-induced exon"))
Merged %>%
ggplot(aes(x=Ratio, color=Sample)) +
stat_ecdf() +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
labs(y="ecdf")
Merged %>%
ggplot(aes(x=Ratio, color=Sample)) +
stat_ecdf(aes(linetype=Group)) +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
labs(y="ecdf") +
coord_cartesian(c(1E-2, 1)) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing")
Merged %>%
ggplot(aes(x=Ratio, color=Sample)) +
stat_ecdf(aes(linetype=Group)) +
geom_vline(xintercept=1) +
labs(y="ecdf") +
coord_cartesian(c(1E-2, 1)) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing")
Yang suggested distinguishing introns by ED50…
ModelFits <- read_tsv("../code/SmallMolecule/FitModels/polyA_GAGTIntrons.tsv.gz")
Merged %>%
# filter(!is.na(GAGTIntron)) %>%
left_join(
ModelFits %>%
filter(param=="ED50:(Intercept)"),
by=c("GAGTIntron"="junc")
) %>%
mutate(ED50Group = cut_number(Estimate, 4)) %>%
ggplot(aes(x=Ratio, color=ED50Group)) +
stat_ecdf() +
geom_vline(xintercept=1) +
labs(y="ecdf") +
scale_x_continuous(trans="log10") +
coord_cartesian(c(1E-2, 1)) +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
facet_wrap(~Sample) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing")
Yang suggested matching introns by length…
Introns <- read_tsv("../code/SmallMolecule/FullSpliceSiteAnnotations/JuncfilesMerged.annotated.basic.bed.gz") %>%
filter(known_junction == 0) %>%
mutate(SpliceAcceptor = if_else(
strand == "+",
paste(chrom, end, sep="_"),
paste(chrom, start, sep="_")
)) %>%
add_count(SpliceAcceptor)
ggplot(Introns, aes(x=n)) +
stat_ecdf() +
coord_cartesian(x=c(0,10)) +
labs(x="Number donors associated with each annotated 3'ss", y="ecdf")
Merged %>%
left_join(
ModelFits %>%
filter(param=="ED50:(Intercept)"),
by=c("GAGTIntron"="junc")
) %>%
mutate(ED50Group = cut_number(Estimate, 4)) %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Length = end - start) %>%
distinct(chrom, start, end, .keep_all=T) %>%
ggplot(aes(x=Length, color=ED50Group)) +
stat_ecdf() +
scale_x_continuous(trans="log10") +
coord_cartesian(xlim=c(100, 1E5)) +
labs(y="ecdf") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(x="Intron length by group")
Merged %>%
left_join(
ModelFits %>%
filter(param=="ED50:(Intercept)"),
by=c("GAGTIntron"="junc")
) %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Length = end - start) %>%
mutate(LengthGroup = cut_number(Length, 5)) %>%
distinct(chrom, start, end, .keep_all=T) %>%
ggplot(aes(x=Ratio, color=LengthGroup)) +
stat_ecdf() +
scale_x_continuous(trans="log10") +
coord_cartesian(c(1E-1, 1)) +
labs(y="ecdf") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing", color="Intron length ntile")
Ok, so there is some confounding by intron length… Longer introns are spliced faster (which i find surprising anyway), and the small-molecule affected introns tend to be slightly longer. So if you correct for the length effect, either by comparing length-matched groups or fitting a model with length as covariate, perhaps we would see that the small molecule effected introns are in fact spliced slow.
lm.fit <- Merged %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Dummy_1_ForIntronsWithCrypticExons = if_else(is.na(GAGTIntron), 0, 1)) %>%
mutate(Length = log10(end - start)) %>%
mutate(Response = log10(Ratio)) %>%
lm(Response ~ Length + Dummy_1_ForIntronsWithCrypticExons, data=.)
summary(lm.fit)
Call:
lm(formula = Response ~ Length + Dummy_1_ForIntronsWithCrypticExons,
data = .)
Residuals:
Min 1Q Median 3Q Max
-3.7029 -0.2219 0.0768 0.3164 2.9021
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.338914 0.002776 -122.100 < 2e-16 ***
Length -0.055253 0.000771 -71.662 < 2e-16 ***
Dummy_1_ForIntronsWithCrypticExons -0.025155 0.006006 -4.189 2.81e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4536 on 724009 degrees of freedom
Multiple R-squared: 0.007087, Adjusted R-squared: 0.007085
F-statistic: 2584 on 2 and 724009 DF, p-value: < 2.2e-16
library(MASS)
# library(sfsmisc)
rlm.fit <- Merged %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Dummy_1_ForIntronsWithCrypticExons = if_else(is.na(GAGTIntron), 0, 1)) %>%
mutate(Length = log10(end - start)) %>%
mutate(Response = log10(Ratio)) %>%
rlm(Response ~ Length + Dummy_1_ForIntronsWithCrypticExons, data=.)
summary(rlm.fit)
Call: rlm(formula = Response ~ Length + Dummy_1_ForIntronsWithCrypticExons,
data = .)
Residuals:
Min 1Q Median 3Q Max
-3.74099 -0.26470 0.03432 0.27396 2.85295
Coefficients:
Value Std. Error t value
(Intercept) -0.3249 0.0024 -137.9842
Length -0.0471 0.0007 -72.0779
Dummy_1_ForIntronsWithCrypticExons -0.0152 0.0051 -2.9767
Residual standard error: 0.401 on 724009 degrees of freedom
Ok, so somewhat significant in a direction consistent with the hypothesis.. Wait, I feel I did that oddly. t makes more natural sense for the Dummy_1_ForIntronsWithCrypticExons to be a response…
lm.fit <- Merged %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Dummy_1_ForIntronsWithCrypticExons = if_else(is.na(GAGTIntron), 0, 1)) %>%
mutate(Length = log10(end - start)) %>%
mutate(Response = log10(Ratio)) %>%
lm(Dummy_1_ForIntronsWithCrypticExons ~ Length + Response, data=.)
summary(lm.fit)
Call:
lm(formula = Dummy_1_ForIntronsWithCrypticExons ~ Length + Response,
data = .)
Residuals:
Min 1Q Median 3Q Max
-0.01593 -0.00932 -0.00801 -0.00657 0.99709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0019304 0.0005487 -3.518 0.000435 ***
Length 0.0026496 0.0001514 17.502 < 2e-16 ***
Response -0.0009633 0.0002300 -4.189 2.81e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.08877 on 724009 degrees of freedom
Multiple R-squared: 0.0004674, Adjusted R-squared: 0.0004647
F-statistic: 169.3 on 2 and 724009 DF, p-value: < 2.2e-16
Ok, the sign of the effect and P-value are more or less the same…
Now Let’s do the length matching thing and plot as ecdf more more intuitive visualization of the effect…
#Get control introns as the introns just bigger or just smaller than the test introns in a list of introns ranked by length
Merged.tidyForMatching <- Merged %>%
left_join(
ModelFits %>%
filter(param=="ED50:(Intercept)"),
by=c("GAGTIntron"="junc")
) %>%
mutate(ED50Group = cut_number(Estimate, 4)) %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Length = end - start) %>%
group_by(Sample) %>%
arrange(Length) %>%
mutate(LaggingIntronGroup = lag(Group)) %>%
mutate(LeadingIntronGroup = lead(Group)) %>%
ungroup()
Merged.WithLengthMatchedControlIntrons <-
bind_rows(
Merged.tidyForMatching %>%
filter(Group == "Contains SM-induced exon"),
Merged.tidyForMatching %>%
filter(!(Group == "Contains SM-induced exon") & LaggingIntronGroup =="Contains SM-induced exon"),
Merged.tidyForMatching %>%
filter(!(Group == "Contains SM-induced exon") & LeadingIntronGroup =="Contains SM-induced exon")
)
P <- Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Length, color=Group)) +
stat_ecdf() +
scale_x_continuous(trans="log10") +
labs(y="ecdf", title="No SM-induced exons are length matched")
P
P +facet_wrap(~Group)
Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Ratio, color=Sample)) +
stat_ecdf(aes(linetype=Group)) +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
labs(y="ecdf") +
coord_cartesian(c(1E-1, 1)) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing")
Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Ratio, color=Sample)) +
stat_ecdf(aes(linetype=Group)) +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
labs(y="ecdf") +
coord_cartesian(c(1E-1, 1)) +
facet_wrap(~Sample) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing")
Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Ratio, color=ED50Group)) +
stat_ecdf(aes(linetype=Group)) +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(1E-2, 1)) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing") +
facet_wrap(~Sample)
Ok but in the last plot I didn’t length match per ED50 group… let’s do that…
Merged.tidyForMatching <- Merged %>%
left_join(
ModelFits %>%
filter(param=="ED50:(Intercept)"),
by=c("GAGTIntron"="junc")
) %>%
mutate(ED50Group = cut_number(Estimate, 4)) %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Length = end - start) %>%
group_by(Sample) %>%
arrange(Length) %>%
mutate(LaggingIntronGroup = lag(ED50Group)) %>%
mutate(LeadingIntronGroup = lead(ED50Group)) %>%
ungroup()
Merged.WithLengthMatchedControlIntrons <-
bind_rows(
Merged.tidyForMatching %>%
filter(Group == "Contains SM-induced exon"),
Merged.tidyForMatching %>%
filter(!Group == "Contains SM-induced exon" & !is.na(LaggingIntronGroup)) %>%
mutate(ED50Group = LaggingIntronGroup),
Merged.tidyForMatching %>%
filter(!Group == "Contains SM-induced exon" & !is.na(LeadingIntronGroup)) %>%
mutate(ED50Group = LeadingIntronGroup)
)
P <- Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Length, color=ED50Group)) +
stat_ecdf(aes(linetype=Group)) +
scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf", title="No SM-induced exons are length matched")
P
P +facet_wrap(~Group)
Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Ratio, color=ED50Group)) +
stat_ecdf(aes(linetype=Group)) +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(1E-2, 1)) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing") +
facet_wrap(~Sample)
Merged.WithLengthMatchedControlIntrons %>%
filter(ED50Group == "[20.4,843]") %>%
ggplot(aes(x=Ratio, color=ED50Group)) +
stat_ecdf(aes(linetype=Group)) +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(1E-2, 1)) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing") +
facet_wrap(~Sample)
Overall not so convincing…
One other thing Yang asked for some of the same analysis but looking for correlations with splice site scores to explain where the crpyic exons are… Like do the cryptic exons have particular strong upstream 3’ss? I’m not sure why it would make sense to look at any other splice site. Though, while I’m scoring splice sites I mine as well also just verify a previous observation that introns with weak splice sites are spliced slower, as a positive control that I am doing these analysis correctly.
ThreeSS.Scores <- read_tsv("../code/SmallMolecule/MaxEntScan/3ss.tsv.gz", col_names=c("Seq", "Score"))
ThreeSS.Seqs <- read_tsv("../code/SmallMolecule/MaxEntScan/3ss.seqs.bed.gz", col_names=c("chrom", "start", "stop", "AcceptorName", "score", "strand", "Seq")) %>%
inner_join(ThreeSS.Scores) %>%
dplyr::select(AcceptorName, Seq, Score) %>%
distinct()
Merged %>%
inner_join(ThreeSS.Seqs, by=c("name"="AcceptorName")) %>%
mutate(AcceptorScoreGroup = cut_number(Score, 4)) %>%
ggplot(aes(x=Ratio, color=AcceptorScoreGroup)) +
stat_ecdf() +
geom_vline(xintercept=1) +
scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(1E-2, 1)) +
labs(x="Upstream/Downstream coverage at 3'ss\nFastSplicing<-->SlowSplicing") +
facet_wrap(~Sample)
Ok, there seems to be a sensible effect wherein 3’ss in the worst quartile of MaxEntScores have slower splicing…
Now let’s check the scores of the downstream 3’ss of host introns with or without the induced cryptic exons…
Merged %>%
inner_join(ThreeSS.Seqs, by=c("name"="AcceptorName")) %>%
distinct(name, .keep_all=T) %>%
ggplot(aes(x=Score, color=Group)) +
stat_ecdf(aes()) +
# geom_vline(xintercept=1) +
# scale_x_continuous(trans="log10") +
# scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(0, 15)) +
labs(x="3ss MaxEntScore")
Merged %>%
left_join(
ModelFits %>%
filter(param=="ED50:(Intercept)"),
by=c("GAGTIntron"="junc")
) %>%
mutate(ED50Group = cut_number(Estimate, 4)) %>%
inner_join(ThreeSS.Seqs, by=c("name"="AcceptorName")) %>%
distinct(name, .keep_all=T) %>%
ggplot(aes(x=Score, color=ED50Group)) +
stat_ecdf(aes(linetype=Group)) +
# geom_vline(xintercept=1) +
# scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(0, 15)) +
labs(x="3ss MaxEntScore")
Ok, now let’s check for a correlation between MaxEntScore and intron length…
inner_join(Introns, ThreeSS.Seqs, by=c("SpliceAcceptor"="AcceptorName")) %>%
mutate(Length = end - start) %>%
mutate(LengthGroup = cut_number(Length, 5)) %>%
ggplot(aes(x=Score, color=LengthGroup)) +
stat_ecdf() +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(0, 15)) +
labs(x="3ss MaxEntScore")
inner_join(Introns, ThreeSS.Seqs, by=c("SpliceAcceptor"="AcceptorName")) %>%
mutate(Length = end - start) %>%
ggplot(aes(y=Score, x=log10(Length))) +
geom_hex(bins=100) +
geom_smooth(method='lm') +
scale_x_continuous(trans='log10') +
labs(x="IntronLength", y="3ss MaxEntScore")
Hmmm, ok maybe there is some odd relationship with 3’ss MaxEntScore and length.. In any case, now let’s compare to length-matched control introns just in case there is correlation between length and 3ss MaxEntScore.
Merged.tidyForMatching <- Merged %>%
left_join(
ModelFits %>%
filter(param=="ED50:(Intercept)"),
by=c("GAGTIntron"="junc")
) %>%
mutate(ED50Group = cut_number(Estimate, 4)) %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
mutate(Length = end - start) %>%
group_by(Sample) %>%
arrange(Length) %>%
mutate(LaggingIntronGroup = lag(ED50Group)) %>%
mutate(LeadingIntronGroup = lead(ED50Group)) %>%
ungroup()
Merged.WithLengthMatchedControlIntrons <-
bind_rows(
Merged.tidyForMatching %>%
filter(Group == "Contains SM-induced exon"),
Merged.tidyForMatching %>%
filter(!Group == "Contains SM-induced exon" & !is.na(LaggingIntronGroup)) %>%
mutate(ED50Group = LaggingIntronGroup),
Merged.tidyForMatching %>%
filter(!Group == "Contains SM-induced exon" & !is.na(LeadingIntronGroup)) %>%
mutate(ED50Group = LeadingIntronGroup)
) %>%
inner_join(ThreeSS.Seqs, by=c("name"="AcceptorName")) %>%
distinct(name, .keep_all=T)
P <- Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Length, color=ED50Group)) +
stat_ecdf(aes(linetype=Group)) +
scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf", title="No SM-induced exons are length matched")
P
P + facet_wrap(~Group)
Merged.WithLengthMatchedControlIntrons %>%
ggplot(aes(x=Score, color=ED50Group)) +
stat_ecdf(aes(linetype=Group)) +
# geom_vline(xintercept=1) +
# scale_x_continuous(trans="log10") +
scale_color_brewer(palette = "YlGnBu", na.value="black", direction=-1) +
labs(y="ecdf") +
coord_cartesian(c(0, 15)) +
labs(x="3ss MaxEntScore") +
facet_wrap(~ED50Group)
Ok, nothing obvious here. But since Yang seemed to be satisfied enough to use the lm model results for assessing relationship between 3ss coverage and host intron having cryptic exon, let’s do the same sort of linear model with the DummyVariable as the response, and Length and MaxEntScore as predictors…
lm.fit <- Merged %>%
inner_join(ThreeSS.Seqs, by=c("name"="AcceptorName")) %>%
inner_join(Introns, by=c("name"="SpliceAcceptor")) %>%
distinct(name, .keep_all=T) %>%
mutate(Dummy_1_ForIntronsWithCrypticExons = if_else(is.na(GAGTIntron), 0, 1)) %>%
mutate(Length = log10(end - start)) %>%
lm(Dummy_1_ForIntronsWithCrypticExons ~ Length + Score, data=.)
summary(lm.fit)
Call:
lm(formula = Dummy_1_ForIntronsWithCrypticExons ~ Length + Score,
data = .)
Residuals:
Min 1Q Median 3Q Max
-0.00522 -0.00400 -0.00374 -0.00345 0.99787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.625e-03 1.133e-03 1.435 0.1514
Length 2.366e-04 2.768e-04 0.855 0.3927
Score 1.467e-04 7.274e-05 2.017 0.0436 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06078 on 97084 degrees of freedom
Multiple R-squared: 5.216e-05, Adjusted R-squared: 3.156e-05
F-statistic: 2.532 on 2 and 97084 DF, p-value: 0.07949
Ok.. Pval is 0.0436… I don’t think we shold take anything from this…
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
[4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MASS_7.3-56 edgeR_3.38.4 limma_3.52.4 data.table_1.14.2
[5] RColorBrewer_1.1-3 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
[9] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
[13] ggplot2_3.3.6 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] httr_1.4.3 sass_0.4.1 splines_4.2.0 bit64_4.0.5
[5] vroom_1.5.7 jsonlite_1.8.0 modelr_0.1.8 bslib_0.3.1
[9] assertthat_0.2.1 highr_0.9 cellranger_1.1.0 yaml_2.3.5
[13] pillar_1.7.0 backports_1.4.1 lattice_0.20-45 glue_1.6.2
[17] digest_0.6.29 promises_1.2.0.1 rvest_1.0.2 colorspace_2.0-3
[21] Matrix_1.5-3 htmltools_0.5.2 httpuv_1.6.5 pkgconfig_2.0.3
[25] broom_0.8.0 haven_2.5.0 scales_1.2.0 later_1.3.0
[29] tzdb_0.3.0 git2r_0.30.1 mgcv_1.8-40 generics_0.1.2
[33] farver_2.1.0 ellipsis_0.3.2 withr_2.5.0 hexbin_1.28.3
[37] cli_3.3.0 magrittr_2.0.3 crayon_1.5.1 readxl_1.4.0
[41] evaluate_0.15 fs_1.5.2 fansi_1.0.3 nlme_3.1-157
[45] xml2_1.3.3 tools_4.2.0 hms_1.1.1 lifecycle_1.0.1
[49] munsell_0.5.0 reprex_2.0.1 locfit_1.5-9.7 compiler_4.2.0
[53] jquerylib_0.1.4 rlang_1.0.2 grid_4.2.0 rstudioapi_0.13
[57] labeling_0.4.2 rmarkdown_2.14 gtable_0.3.0 DBI_1.1.2
[61] R6_2.5.1 lubridate_1.8.0 knitr_1.39 fastmap_1.1.0
[65] bit_4.0.4 utf8_1.2.2 workflowr_1.7.0 rprojroot_2.0.3
[69] stringi_1.7.6 parallel_4.2.0 Rcpp_1.0.8.3 vctrs_0.4.1
[73] dbplyr_2.1.1 tidyselect_1.1.2 xfun_0.30