ResponseToReviewer

Last updated: 2020-09-23

Checks: 6 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190319)

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 52f85bb

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    WorkingManuscript.zip
    Ignored:    WorkingManuscript/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis_temp/.DS_Store
    Ignored:    big_data/
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    code/snakemake_workflow/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    figures/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  analysis/20200907_Response_Point_02.Rmd
    Untracked:  analysis/20200907_Response_Point_04.Rmd
    Untracked:  data/c5.all.v7.1.symbols.gmt
    Untracked:  data/c5.all.v7.1.symbols.gmt.categories.tsv.gz
    Untracked:  data/h.all.v7.1.symbols.gmt

Unstaged changes:
    Modified:   analysis/20200907_Response_OriginalComments.Rmd
    Modified:   analysis/20200907_Response_Point_06.Rmd
    Modified:   analysis/20200907_Response_Point_09-2.Rmd
    Modified:   analysis/20200907_Response_Point_09.Rmd
    Modified:   analysis/20200907_Response_Point_11.Rmd
    Modified:   analysis/Final_2_DispersionPlots.Rmd
    Modified:   analysis_temp/TabulaMuris_analysis2.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Overview

Original reviewer point:

The second test is to correlate the higher coding sequence conservation with lower dispersion. Again, the positive result is not unexpected. There are many indirect and/or confounding factors that may explain the effect. This reviewer, however, understands it is impossible to control them all (also authors have attempted to address some of them in the next few tests). However, here it is better to add exploratory analyses for genes in different functional groups and also give examples of outlier genes that do not follow the rule.

This proposed analysis was difficult for me to come up with a way to enact. One idea is to pick each a set of gene annotation groups (which are often overlapping, complicating the matter), and then perform the dn/ds vs dispersion correlation test for each gene group. The most appropriate gene set annotations I found to use for this purpose is the MSigDB Hallmark gene collection, which contains 50 gene sets, each with mostly non-overlapping sets of genes no greater than 200 per set.

Another approach, which I will also enact below, is to perform a standard GSEA analysis using the residual of the genome-wide dn/ds vs dispersion trend. In other words, what gene sets have higher dispersion than you would expect given its dn/ds.

Analysis

First, load necessary libraries… and read in data

library(tidyverse)
library(knitr)
library(stats)
library("clusterProfiler")
library("org.Hs.eg.db")
library(broom)
library(qvalue)
library(data.table)
library(Cairo)
library(MASS)

#read in table of overdispersion, dispersion, mean expresion estimates
Dispersion <- read.delim('../output/OverdispersionEstimatesFromChimp.txt')

Dispersion$MeanDispersion <- (Dispersion$Chimp.Residual + Dispersion$Human.Residual)/2

#Read in table of dn/ds estimates
PanMammal.dnds <- read.delim("../data/Overall_dN_dS.bed", col.names = c("chrom", 'start', "stop", "gene", "score", "strand", "blockstart", "blockstop", "color", "ensemblprotein", "dn.ds"))

#Gene name symbols
Symbols <- read.delim("../data/HumanGeneIdToHGNC_Symbol.Biomart.txt.gz", stringsAsFactors = F) %>% dplyr::select(ensembl_gene_id=Gene.stable.ID, hgnc_symbol=HGNC.symbol)

MergedTable <- left_join(Dispersion, Symbols, by=c("gene"="ensembl_gene_id")) %>% 
  left_join(PanMammal.dnds, by=c("hgnc_symbol"="gene")) %>%
  dplyr::select(gene, Chimp.Residual, Human.Residual, MeanDispersion, dn.ds) %>%
  drop_na() %>%
  filter(dn.ds>0) #drop things that will cause error when taking log.

head(MergedTable) %>% kable()

gene	Chimp.Residual	Human.Residual	MeanDispersion	dn.ds
ENSG00000186891	2.3121799	0.0805422	1.1963611	0.472922
ENSG00000186827	1.6405781	1.9529450	1.7967615	0.279429
ENSG00000078808	-0.5386575	-0.8183559	-0.6785067	0.076499
ENSG00000176022	-0.2006514	-0.6565939	-0.4286227	0.048958
ENSG00000160087	-0.4256215	-0.7154168	-0.5705192	0.107057
ENSG00000169972	-0.4701338	0.4063494	-0.0318922	0.157678

GmtFile <- "/Users/benfair/Downloads/c5.all.v7.1.symbols.gmt"

NumFields <- max(count.fields(GmtFile, sep = '\t'))
HallmarkGeneSets <- read.table(GmtFile, header = FALSE, sep = "\t", 
  col.names = c("GeneSetName", "link", paste0("V",seq_len(NumFields-2))), fill = TRUE, stringsAsFactors = F, na.strings = "") %>%
  dplyr::select(-link) %>%
  gather(key="NumGeneInSet", value="gene", -GeneSetName) %>%
  drop_na() %>%
  arrange(GeneSetName, NumGeneInSet) %>%
  dplyr::select(-NumGeneInSet) %>%
  inner_join(Symbols, by=c("gene"="hgnc_symbol"))

head(HallmarkGeneSets) %>% kable()

GeneSetName	gene	ensembl_gene_id
GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY	TYMP	ENSG00000025708
GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY	GDPGP1	ENSG00000183208
GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY	MTAP	ENSG00000099810
GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY	PYGB	ENSG00000100994
GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY	PYGL	ENSG00000100504
GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY	PYGM	ENSG00000068976

Now I will get the dn/ds vs dispersion correlation coefficient and p value for each gene set.

Cor.twosidedtest <- HallmarkGeneSets %>%
  left_join(MergedTable, by=c("ensembl_gene_id"="gene")) %>%
  drop_na() %>%
  add_count(GeneSetName) %>%
  filter(n>=5) %>%
  group_by(GeneSetName) %>%
  do(tidy(cor.test(.$dn.ds, .$MeanDispersion, method = "spearman", exact = T))) %>%
  filter(p.value>0) #filter out buggy cases because of ties in spearman cor test

Cor.twosidedtest %>% head() %>% kable()

GeneSetName	estimate	statistic	p.value	method	alternative
GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACTIVITY	0.1000000	18	0.9500000	Spearman’s rank correlation rho	two.sided
GO_1_PHOSPHATIDYLINOSITOL_3_KINASE_ACTIVITY	-0.1000000	22	0.9500000	Spearman’s rank correlation rho	two.sided
GO_1_PHOSPHATIDYLINOSITOL_3_KINASE_REGULATOR_ACTIVITY	0.0329670	440	0.9155316	Spearman’s rank correlation rho	two.sided
GO_1_PHOSPHATIDYLINOSITOL_BINDING	0.2142857	44	0.6615079	Spearman’s rank correlation rho	two.sided
GO_14_3_3_PROTEIN_BINDING	-0.0979021	314	0.7662885	Spearman’s rank correlation rho	two.sided
GO_2_IRON_2_SULFUR_CLUSTER_BINDING	-0.3901099	506	0.1887717	Spearman’s rank correlation rho	two.sided

hist(Cor.twosidedtest$estimate)

hist(Cor.twosidedtest$p.value)

Given how a lot of these tests are not independent (due to overlapping gene sets), I wasn’t sure what to expect in terms of well calibrated P-values. This histogram is a bit comforting. Let’s identify genes at some FDR threshold with Storey’s q-value method, and see what the strongest gene categories that drive this correlation are.

Cor.twosidedtest$q <- qvalue(Cor.twosidedtest$p.value)$qvalues

#How many sets FDR<0.1
(Cor.twosidedtest$q < 0.1 ) %>% table()

.
FALSE  TRUE 
 6169   614

#Are these significant sets basically just the really big sets with lots of power?
Cor.twosidedtest <- Cor.twosidedtest %>%
  mutate(Sig=q<0.1) %>%
  left_join(
    HallmarkGeneSets %>% left_join(MergedTable, by=c("ensembl_gene_id"="gene")) %>%
      drop_na() %>%
      count(GeneSetName) %>%
      filter(n>=5),
    by="GeneSetName"
  )
ggplot(Cor.twosidedtest, aes(x=n, color=Sig)) +
  geom_histogram() +
  theme_bw()

Ok that looks good. Let’s make a plot in a similar format to my other GO analyses, separated by BP, MF, or CC. To do this I need to read in different files to map which GO terms belong to which of those categories.

GO.categories <- read.table("../data/c5.all.v7.1.symbols.gmt.categories.tsv.gz", col.names = c("GeneSetName", "Category"), sep='\t') %>%
  mutate(Category=gsub("/Users/benfair/Downloads/c5.(.+?).v7.1.symbols.gmt",replacement = "\\1",Category
  ))


GO.Summary.plot <- Cor.twosidedtest %>% 
  left_join(GO.categories, by="GeneSetName") %>%
  dplyr::select(GeneSetName, SpearmansRho=estimate, p.value, p.adjust=q, setSize=n, ONTOLOGY=Category) %>%
  mutate(Polarization=sign(SpearmansRho)) %>%
  filter(!Polarization==0) %>%
  # filter(str_detect(Description, 'DNA')) %>%
  # group_by(ONTOLOGY, Polarization) %>%
  group_by(ONTOLOGY, Polarization) %>%
  arrange(p.adjust) %>%
  mutate(rowN=row_number()) %>%
  # top_n(n = 3, wt = -log(p.adjust)) %>%
  ungroup() %>%
  filter(rowN<=3) %>%
  filter(p.adjust<0.1) %>%
  mutate(GeneSetName=str_remove(GeneSetName, "^GO_")) %>%
  mutate(ONTOLOGY=recode(ONTOLOGY, `bp` = "Biological Process", `cc` = "Cellular Component", `mf`="Molecular Function")) %>%
  ggplot(aes(x=SpearmansRho, y=GeneSetName, color=p.adjust, size=setSize)) +
  geom_point() +
  scale_x_continuous(limits=c(-1,1), breaks=c(-1,0,1)) +
  facet_grid(ONTOLOGY~., scales = "free") +
  scale_colour_viridis_c(trans="log10", direction=-1, option="D", limits=c(1E-16, 0.1)) +
  # scale_colour_gradient(low="red", high="black") +
  facet_grid(ONTOLOGY~., scales = "free") +
  # xlab("Enrichment\nOverdispersedInHuman<-->OverdispersedInChimp") +
  xlab("Spearman's \u03C1") +
  scale_y_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) +
  labs(color = "Adjusted P-value") +
  geom_vline(xintercept = 0, linetype="dotted") +
  theme_bw() +
  theme(axis.text.y = element_text(size=6)) + 
  ylab(NULL)
GO.Summary.plot

I am skeptical of the negative associations. Consistent with the story I have been building, the strongest associations between dn.ds and dispersion are in immune related genes.

Let’s plot that correlation as a scatter plot, along with some randomly chosen category of similar size for comparison

#Find gene categories similar in size as comparison
Cor.twosidedtest %>% 
  filter(GeneSetName=="GO_REGULATION_OF_IMMUNE_SYSTEM_PROCESS") %>% kable()

GeneSetName	estimate	statistic	p.value	method	alternative	q	Sig	n
GO_REGULATION_OF_IMMUNE_SYSTEM_PROCESS	0.3600718	25295938	0	Spearman’s rank correlation rho	two.sided	0	TRUE	619

Cor.twosidedtest %>% 
  filter(n>579 & n< 649) %>% kable()

GeneSetName	estimate	statistic	p.value	method	alternative	q	Sig	n
GO_ADENYL_NUCLEOTIDE_BINDING	0.0722987	42070813	0.0658743	Spearman’s rank correlation rho	two.sided	0.3054643	FALSE	648
GO_CELL_ACTIVATION	0.2846160	24115765	0.0000000	Spearman’s rank correlation rho	two.sided	0.0000000	TRUE	587
GO_CELL_PROJECTION_ORGANIZATION	0.0423731	34130874	0.3009068	Spearman’s rank correlation rho	two.sided	0.6229631	FALSE	598
GO_CELLULAR_MACROMOLECULE_CATABOLIC_PROCESS	0.1653719	34617254	0.0000315	Spearman’s rank correlation rho	two.sided	0.0012964	TRUE	629
GO_CHROMOSOME_ORGANIZATION	0.1225912	28532078	0.0031047	Spearman’s rank correlation rho	two.sided	0.0456120	TRUE	580
GO_ENDOPLASMIC_RETICULUM	0.1391738	37963676	0.0004100	Spearman’s rank correlation rho	two.sided	0.0094061	TRUE	642
GO_INTRACELLULAR_PROTEIN_TRANSPORT	0.1844152	33345693	0.0000034	Spearman’s rank correlation rho	two.sided	0.0002150	TRUE	626
GO_LIPID_METABOLIC_PROCESS	0.0945334	39932380	0.0165768	Spearman’s rank correlation rho	two.sided	0.1385821	FALSE	642
GO_LOCOMOTION	0.2529728	29960922	0.0000000	Spearman’s rank correlation rho	two.sided	0.0000000	TRUE	622
GO_MOLECULAR_FUNCTION_REGULATOR	0.2613738	30928615	0.0000000	Spearman’s rank correlation rho	two.sided	0.0000000	TRUE	631
GO_NEGATIVE_REGULATION_OF_NUCLEOBASE_CONTAINING_COMPOUND_METABOLIC_PROCESS	0.0541471	37752527	0.1777842	Spearman’s rank correlation rho	two.sided	0.5020328	FALSE	621
GO_NEGATIVE_REGULATION_OF_SIGNALING	0.1903382	30931822	0.0000022	Spearman’s rank correlation rho	two.sided	0.0001482	TRUE	612
GO_NEUROGENESIS	0.1063929	31530649	0.0093411	Spearman’s rank correlation rho	two.sided	0.0976614	TRUE	596
GO_ORGANIC_ACID_METABOLIC_PROCESS	0.0489076	31897971	0.2371635	Spearman’s rank correlation rho	two.sided	0.5709386	FALSE	586
GO_PROTEIN_MODIFICATION_BY_SMALL_PROTEIN_CONJUGATION_OR_REMOVAL	0.1040228	31773678	0.0110116	Spearman’s rank correlation rho	two.sided	0.1087103	FALSE	597
GO_REGULATION_OF_CELL_CYCLE	0.1377691	28622634	0.0008434	Spearman’s rank correlation rho	two.sided	0.0171851	TRUE	584
GO_REGULATION_OF_CELL_POPULATION_PROLIFERATION	0.2116038	32544014	0.0000001	Spearman’s rank correlation rho	two.sided	0.0000100	TRUE	628
GO_REGULATION_OF_IMMUNE_SYSTEM_PROCESS	0.3600718	25295938	0.0000000	Spearman’s rank correlation rho	two.sided	0.0000000	TRUE	619
GO_REGULATION_OF_ORGANELLE_ORGANIZATION	0.1327102	29087378	0.0012930	Spearman’s rank correlation rho	two.sided	0.0238330	TRUE	586

#Check out top hits ranked by spearman estimate
Cor.twosidedtest %>%
  filter(q<0.1) %>%
  filter(n>=10) %>%
  arrange(desc(estimate)) %>% head(30) %>% kable()

GeneSetName	estimate	statistic	p.value	method	alternative	q	Sig	n
GO_LEUKOCYTE_TETHERING_OR_ROLLING	0.8727273	28	0.0009490	Spearman’s rank correlation rho	two.sided	0.0189130	TRUE	11
GO_INTEGRIN_ACTIVATION	0.8666667	22	0.0026814	Spearman’s rank correlation rho	two.sided	0.0410333	TRUE	10
GO_PHOSPHATIDIC_ACID_BINDING	0.8666667	22	0.0026814	Spearman’s rank correlation rho	two.sided	0.0410333	TRUE	10
GO_LEUKOCYTE_ADHESION_TO_VASCULAR_ENDOTHELIAL_CELL	0.8441176	106	0.0000193	Spearman’s rank correlation rho	two.sided	0.0008800	TRUE	16
GO_PROTEIN_LIPID_COMPLEX_ASSEMBLY	0.8303030	28	0.0055568	Spearman’s rank correlation rho	two.sided	0.0696243	TRUE	10
GO_POSITIVE_REGULATION_OF_STEROID_METABOLIC_PROCESS	0.8131868	68	0.0012389	Spearman’s rank correlation rho	two.sided	0.0232379	TRUE	13
GO_PROTEIN_HETEROTETRAMERIZATION	0.8090909	42	0.0044280	Spearman’s rank correlation rho	two.sided	0.0593275	TRUE	11
GO_POSITIVE_REGULATION_OF_OLIGODENDROCYTE_DIFFERENTIATION	0.8060606	32	0.0082356	Spearman’s rank correlation rho	two.sided	0.0900905	TRUE	10
GO_REGULATION_OF_MORPHOGENESIS_OF_A_BRANCHING_STRUCTURE	0.8060606	32	0.0082356	Spearman’s rank correlation rho	two.sided	0.0900905	TRUE	10
GO_NEGATIVE_REGULATION_OF_BLOOD_CIRCULATION	0.8021978	72	0.0016243	Spearman’s rank correlation rho	two.sided	0.0279310	TRUE	13
GO_PROTEIN_LIPID_COMPLEX_SUBUNIT_ORGANIZATION	0.7982456	230	0.0000555	Spearman’s rank correlation rho	two.sided	0.0019680	TRUE	19
GO_ANTIMICROBIAL_HUMORAL_IMMUNE_RESPONSE_MEDIATED_BY_ANTIMICROBIAL_PEPTIDE	0.7972028	58	0.0031613	Spearman’s rank correlation rho	two.sided	0.0458355	TRUE	12
GO_CLATHRIN_COATED_ENDOCYTIC_VESICLE_MEMBRANE	0.7972028	58	0.0031613	Spearman’s rank correlation rho	two.sided	0.0458355	TRUE	12
GO_POSITIVE_REGULATION_OF_INTERFERON_BETA_PRODUCTION	0.7929825	236	0.0000711	Spearman’s rank correlation rho	two.sided	0.0023625	TRUE	19
GO_AMYLOID_BETA_CLEARANCE	0.7928571	116	0.0006740	Spearman’s rank correlation rho	two.sided	0.0142340	TRUE	15
GO_POSITIVE_REGULATION_OF_PHOSPHOLIPID_METABOLIC_PROCESS	0.7909091	46	0.0060608	Spearman’s rank correlation rho	two.sided	0.0734852	TRUE	11
GO_REGULATION_OF_LYMPHOCYTE_CHEMOTAXIS	0.7909091	46	0.0060608	Spearman’s rank correlation rho	two.sided	0.0734852	TRUE	11
GO_POSITIVE_REGULATION_OF_GLYCOPROTEIN_BIOSYNTHETIC_PROCESS	0.7818182	48	0.0070121	Spearman’s rank correlation rho	two.sided	0.0799596	TRUE	11
GO_POSITIVE_REGULATION_OF_GLYCOPROTEIN_METABOLIC_PROCESS	0.7818182	48	0.0070121	Spearman’s rank correlation rho	two.sided	0.0799596	TRUE	11
GO_ACTIVATION_OF_PROTEIN_KINASE_B_ACTIVITY	0.7735294	154	0.0006830	Spearman’s rank correlation rho	two.sided	0.0143763	TRUE	16
GO_RESPONSE_TO_CHEMOKINE	0.7729323	302	0.0000964	Spearman’s rank correlation rho	two.sided	0.0029011	TRUE	20
GO_RESPONSE_TO_INTERFERON_ALPHA	0.7622378	68	0.0058975	Spearman’s rank correlation rho	two.sided	0.0724690	TRUE	12
GO_REGULATION_OF_MYELINATION	0.7558824	166	0.0010695	Spearman’s rank correlation rho	two.sided	0.0209449	TRUE	16
GO_RUFFLE_ASSEMBLY	0.7529412	168	0.0011468	Spearman’s rank correlation rho	two.sided	0.0222990	TRUE	16
GO_POSITIVE_REGULATION_OF_RECEPTOR_MEDIATED_ENDOCYTOSIS	0.7416530	944	0.0000116	Spearman’s rank correlation rho	two.sided	0.0005818	TRUE	28
GO_RESPONSE_TO_INTERFERON_BETA	0.7362637	120	0.0038188	Spearman’s rank correlation rho	two.sided	0.0523761	TRUE	14
GO_PORE_COMPLEX	0.7362637	96	0.0057894	Spearman’s rank correlation rho	two.sided	0.0715548	TRUE	13
GO_POSITIVE_REGULATION_OF_CELL_CYCLE_G1_S_PHASE_TRANSITION	0.7357143	148	0.0025498	Spearman’s rank correlation rho	two.sided	0.0396516	TRUE	15
GO_SERINE_TYPE_ENDOPEPTIDASE_INHIBITOR_ACTIVITY	0.7357143	148	0.0025498	Spearman’s rank correlation rho	two.sided	0.0396516	TRUE	15
GO_REGULATION_OF_PROTEIN_OLIGOMERIZATION	0.7087912	106	0.0088286	Spearman’s rank correlation rho	two.sided	0.0939976	TRUE	13

#grep for MHC
Cor.twosidedtest %>%
  # filter(q<0.1) %>%
  filter(str_detect(GeneSetName, "MHC")) %>% kable()

GeneSetName	estimate	statistic	p.value	method	alternative	q	Sig	n
GO_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN_VIA_MHC_CLASS_I	0.4606229	8746	0.0014298	Spearman’s rank correlation rho	two.sided	0.0254716	TRUE	46
GO_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_ANTIGEN_VIA_MHC_CLASS_I	0.4183149	18910	0.0011925	Spearman’s rank correlation rho	two.sided	0.0229066	TRUE	58
GO_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_PEPTIDE_OR_POLYSACCHARIDE_ANTIGEN_VIA_MHC_CLASS_II	0.2622974	10468	0.0855649	Spearman’s rank correlation rho	two.sided	0.3471221	FALSE	44
GO_MHC_CLASS_I_PROTEIN_BINDING	0.8285714	6	0.0583333	Spearman’s rank correlation rho	two.sided	0.2834249	FALSE	6
GO_MHC_CLASS_II_BIOSYNTHETIC_PROCESS	0.1785714	46	0.7130952	Spearman’s rank correlation rho	two.sided	0.8503662	FALSE	7
GO_MHC_CLASS_II_PROTEIN_COMPLEX_BINDING	0.4285714	32	0.3535714	Spearman’s rank correlation rho	two.sided	0.6581377	FALSE	7
GO_MHC_PROTEIN_BINDING	0.5181818	106	0.1069182	Spearman’s rank correlation rho	two.sided	0.3915911	FALSE	11
GO_MHC_PROTEIN_COMPLEX_BINDING	0.1272727	144	0.7328868	Spearman’s rank correlation rho	two.sided	0.8641192	FALSE	10

Ok, let’s plot GO_REGULATION_OF_IMMUNE_SYSTEM_PROCESS and GO_ENDOPLASMIC_RETICULUM as illustrative examples.

GeneSetsToPlot <- c("GO_REGULATION_OF_IMMUNE_SYSTEM_PROCESS", "GO_ORGANIC_ACID_METABOLIC_PROCESS", "GO_ANTIGEN_PROCESSING_AND_PRESENTATION_OF_EXOGENOUS_PEPTIDE_ANTIGEN_VIA_MHC_CLASS_I")

Labels <- Cor.twosidedtest %>%
  filter(GeneSetName %in% GeneSetsToPlot) %>%
  mutate(label=paste0(
    "Spearman's \u03C1 = ", signif(estimate,2), "\n",
    "n = ", n, "\n",
    "p = ", format.pval(p.value,2))) %>%
  dplyr::select(GeneSetName, label) %>%
  mutate(GeneSetNameMod=str_replace_all(str_replace(GeneSetName, "^GO", ""), pattern = "_", " "))
  
CorScatterPlots <- HallmarkGeneSets %>%
  left_join(MergedTable, by=c("ensembl_gene_id"="gene")) %>%
  filter(GeneSetName %in% GeneSetsToPlot) %>%
  drop_na() %>%
  mutate(GeneSetNameMod=str_replace_all(str_replace(GeneSetName, "^GO", ""), pattern = "_", " ")) %>%
ggplot(aes(x=dn.ds, y=MeanDispersion)) +
  geom_point(alpha=0.25) +
  scale_x_continuous(trans="log10", limits=c(.001,2)) +
  scale_y_continuous(limits=c(-3,3)) +
  facet_grid(~GeneSetNameMod, labeller = labeller(GeneSetNameMod = label_wrap_gen(width = 25))) +
  xlab("dN/dS") +
  ylab("Dispersion") +
  geom_text(
  data    = Labels,
  mapping = aes(x = 0.001, y = Inf, label = label),
  hjust   = 0,
  vjust   = 1.1) +
  theme_bw()
CorScatterPlots

ggsave("../figures/OriginalArt/ResponseToReviewers.dnds.GO.pdf", plot = CorScatterPlots, device=cairo_pdf, height=3, width=7)

Ok, now save the relevant plots…

ggsave("../figures/OriginalArt/ResponseToReviewers.dnds.GO.pdf", plot = CorScatterPlots, device=cairo_pdf, height=3, width=7)
ggsave("../figures/OriginalArt/ResponseToReviewers.GO.summary.pdf", plot = GO.Summary.plot, device=cairo_pdf, height=4, width=5.5)

GO.test.histogram <- ggplot(Cor.twosidedtest, aes(x=p.value)) +
  geom_histogram(bins=20) +
  theme_bw() +
  ylab("Number GO categories") +
  xlab("Spearman test P values")
ggsave("../figures/OriginalArt/ResponseToReviewers.GO.hist.pdf", plot = GO.test.histogram, height=3, width=3)
write_delim(Cor.twosidedtest, "../figures/OriginalArt/ResponseToReviewers.GO.hist.sorce.tsv", delim = '\t')

Ok, I think I’ve done enough for this approach.

Now let’s try the second idea: plot a regression fit of MeanDispersion (mean dispersion estimate between human and chimp) vs the log(dn/ds) (log transformation since dn/ds is more normally distributed on a log scale, which seems more natural). Here I am using the robust regression to not be thrown off so much for outlier points. The idea is to ask if certain gene categories have more dispersion than you would expect given their coding conservation.

EDIT: later I realized this approach isn’t as interesting as I thought, I’m going to make eval=F to build the site quicker, rather than have R evaluate these code blocks and make the plot results for this next section.

ggplot(MergedTable, aes(y=MeanDispersion, x=log(dn.ds))) +
  geom_point() +
  geom_smooth(method="rlm") +
  theme_bw()


fit <- rlm(MeanDispersion~log(dn.ds), data=MergedTable )
plot(fit)

MergedTable$fitResid <-  resid(fit)

ResidualsFromFit <- MergedTable %>%
  dplyr::select(gene, fitResid) %>%
  deframe() %>% sort(decreasing = T)

head(ResidualsFromFit)

gsea.resid<-gseGO(gene=ResidualsFromFit,
            ont = "ALL",
            OrgDb=org.Hs.eg.db,
            keyType='ENSEMBL',
            nPerm=100000)


ResidDispersionContrastPlot <-
  gsea.resid %>% as.data.frame() %>%
  dplyr::select(Description, ONTOLOGY, p.adjust,enrichmentScore, setSize, NES) %>%
  mutate(Polarization=sign(enrichmentScore)) %>%
  # filter(str_detect(Description, 'DNA')) %>%
  # group_by(ONTOLOGY, Polarization) %>%
  group_by(ONTOLOGY, Polarization) %>%
  top_n(n = 3, wt = abs(NES)) %>%
  ungroup() %>%
  mutate(ONTOLOGY=recode(ONTOLOGY, `BP` = "Biological Process", `CC` = "Cellular Component", `MF`="Molecular Function")) %>%
  ggplot(aes(x=enrichmentScore, y=Description, color=p.adjust, size=setSize)) +
  geom_point() +
  scale_x_continuous(limits=c(-1,1), breaks=c(-1,0,1)) +
  facet_grid(ONTOLOGY~., scales = "free") +
  scale_colour_viridis_c(trans="log10", limits=c(0.0001, 0.1), direction=-1, option="D") +
  # scale_colour_gradient(low="red", high="black") +
  facet_grid(ONTOLOGY~., scales = "free") +
  # xlab("Enrichment\nOverdispersedInHuman<-->OverdispersedInChimp") +
  xlab("Enrichment") +
  scale_y_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) +
  labs(color = "Adjusted P-value") +
  geom_vline(xintercept = 0, linetype="dotted") +
  theme_bw() +
  theme(axis.text.y = element_text(size=6)) + 
  ylab(NULL)
ResidDispersionContrastPlot

To contrast, let’s do the same thing on the original dispersion estimates

MeanDispersion <- MergedTable %>%
  dplyr::select(gene, MeanDispersion) %>%
  deframe() %>% sort(decreasing = T)

gsea.meanDispersion<-gseGO(gene=MeanDispersion,
            ont = "ALL",
            OrgDb=org.Hs.eg.db,
            keyType='ENSEMBL',
            nPerm=100000)

MeanDispersionContrastPlot <-
  gsea.meanDispersion %>% as.data.frame() %>%
  dplyr::select(Description, ONTOLOGY, p.adjust,enrichmentScore, setSize, NES) %>%
  mutate(Polarization=sign(enrichmentScore)) %>%
  # filter(str_detect(Description, 'DNA')) %>%
  # group_by(ONTOLOGY, Polarization) %>%
  group_by(ONTOLOGY, Polarization) %>%
  top_n(n = 3, wt = abs(NES)) %>%
  ungroup() %>%
  mutate(ONTOLOGY=recode(ONTOLOGY, `BP` = "Biological Process", `CC` = "Cellular Component", `MF`="Molecular Function")) %>%
  ggplot(aes(x=enrichmentScore, y=Description, color=p.adjust, size=setSize)) +
  geom_point() +
  scale_x_continuous(limits=c(-1,1), breaks=c(-1,0,1)) +
  facet_grid(ONTOLOGY~., scales = "free") +
  scale_colour_viridis_c(trans="log10", limits=c(0.0001, 0.1), direction=-1, option="D") +
  # scale_colour_gradient(low="red", high="black") +
  facet_grid(ONTOLOGY~., scales = "free") +
  # xlab("Enrichment\nOverdispersedInHuman<-->OverdispersedInChimp") +
  xlab("Enrichment") +
  scale_y_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) +
  labs(color = "Adjusted P-value") +
  geom_vline(xintercept = 0, linetype="dotted") +
  theme_bw() +
  theme(axis.text.y = element_text(size=6)) + 
  ylab(NULL)
MeanDispersionContrastPlot

I see ontology terms related with things that I think of as housekeeping functions (tRNA processing, ribosome, mitochondria complex, RNA processing) as associated with negative residuals (less dispersion than expected, after regressing out dn/ds effect). I still see more cell type specific terms (angiogenesis, leukocyte migration) associated with higher residuals. This is consistent with dispersion partly being related to cell type heterogeneity (genes with housekeeping functions expressed in all cell types being less dispersed), as an effect independent from the negative selection forces that might keep genetically driven dispersion low.

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] MASS_7.3-51.4          Cairo_1.5-12.2         data.table_1.12.8     
 [4] qvalue_2.16.0          broom_0.5.2            org.Hs.eg.db_3.8.2    
 [7] AnnotationDbi_1.46.1   IRanges_2.18.3         S4Vectors_0.22.1      
[10] Biobase_2.44.0         BiocGenerics_0.30.0    clusterProfiler_3.12.0
[13] knitr_1.26             forcats_0.4.0          stringr_1.4.0         
[16] dplyr_0.8.3            purrr_0.3.3            readr_1.3.1           
[19] tidyr_1.0.0            tibble_2.1.3           ggplot2_3.2.1         
[22] tidyverse_1.3.0       

loaded via a namespace (and not attached):
  [1] fgsea_1.10.1        colorspace_1.4-1    ellipsis_0.3.0     
  [4] ggridges_0.5.1      rprojroot_1.3-2     fs_1.3.1           
  [7] rstudioapi_0.10     farver_2.0.1        urltools_1.7.3     
 [10] graphlayouts_0.5.0  ggrepel_0.8.1       bit64_0.9-7        
 [13] fansi_0.4.0         lubridate_1.7.4     xml2_1.2.2         
 [16] splines_3.6.1       GOSemSim_2.10.0     polyclip_1.10-0    
 [19] zeallot_0.1.0       jsonlite_1.6        workflowr_1.5.0    
 [22] GO.db_3.8.2         dbplyr_1.4.2        ggforce_0.3.1      
 [25] BiocManager_1.30.10 compiler_3.6.1      httr_1.4.1         
 [28] rvcheck_0.1.7       backports_1.1.5     assertthat_0.2.1   
 [31] Matrix_1.2-18       lazyeval_0.2.2      cli_2.0.0          
 [34] later_1.0.0         tweenr_1.0.1        htmltools_0.4.0    
 [37] prettyunits_1.0.2   tools_3.6.1         igraph_1.2.4.2     
 [40] gtable_0.3.0        glue_1.3.1          reshape2_1.4.3     
 [43] DO.db_2.9           fastmatch_1.1-0     Rcpp_1.0.5         
 [46] enrichplot_1.4.0    cellranger_1.1.0    vctrs_0.2.0        
 [49] nlme_3.1-143        ggraph_2.0.0        xfun_0.11          
 [52] rvest_0.3.5         lifecycle_0.1.0     DOSE_3.10.2        
 [55] europepmc_0.3       scales_1.1.0        tidygraph_1.1.2    
 [58] hms_0.5.2           promises_1.1.0      RColorBrewer_1.1-2 
 [61] yaml_2.2.0          memoise_1.1.0       gridExtra_2.3      
 [64] UpSetR_1.4.0        triebeard_0.3.0     stringi_1.4.3      
 [67] RSQLite_2.1.4       highr_0.8           BiocParallel_1.18.1
 [70] rlang_0.4.1         pkgconfig_2.0.3     evaluate_0.14      
 [73] lattice_0.20-38     labeling_0.3        cowplot_1.0.0      
 [76] bit_1.1-14          tidyselect_0.2.5    plyr_1.8.5         
 [79] magrittr_1.5        R6_2.4.1            generics_0.0.2     
 [82] DBI_1.0.0           pillar_1.4.2        haven_2.2.0        
 [85] withr_2.1.2         modelr_0.1.5        crayon_1.3.4       
 [88] rmarkdown_1.18      viridis_0.5.1       progress_1.2.2     
 [91] grid_3.6.1          readxl_1.3.1        blob_1.2.0         
 [94] git2r_0.26.1        reprex_0.3.0        digest_0.6.23      
 [97] httpuv_1.5.2        gridGraphics_0.4-1  munsell_0.5.0      
[100] viridisLite_0.3.0   ggplotify_0.0.4

ResponseToReviewer_Point6

Overview

Analysis