ResponseToReviewer

Last updated: 2020-09-23

Checks: 6 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190319)

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 52f85bb

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    WorkingManuscript.zip
    Ignored:    WorkingManuscript/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis_temp/.DS_Store
    Ignored:    big_data/
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    code/snakemake_workflow/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    figures/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  analysis/20200907_Response_Point_02.Rmd
    Untracked:  analysis/20200907_Response_Point_04.Rmd
    Untracked:  data/c5.all.v7.1.symbols.gmt
    Untracked:  data/c5.all.v7.1.symbols.gmt.categories.tsv.gz
    Untracked:  data/h.all.v7.1.symbols.gmt

Unstaged changes:
    Modified:   analysis/20200907_Response_OriginalComments.Rmd
    Modified:   analysis/20200907_Response_Point_06.Rmd
    Modified:   analysis/20200907_Response_Point_09-2.Rmd
    Modified:   analysis/20200907_Response_Point_09.Rmd
    Modified:   analysis/20200907_Response_Point_11.Rmd
    Modified:   analysis/Final_2_DispersionPlots.Rmd
    Modified:   analysis_temp/TabulaMuris_analysis2.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Original reviewer point:

The fourth test authors conducted is to show that dn/ds and pn/ps ratios of genes are correlated with gene expression variability (variance). However, because of the existence of heterogeneity of cell-type composition in samples, any correlation observed may be utterly biased by this single uncontrollable confounding factor. Furthermore, heart tissues contain an over-abundant expression of genes encoded in the mitochondrial genome. The expression level of these mt-genes may vary substantially between samples and reflect the health status of primary sample donors. PEER normalization may have to take this into account as a covariant.

dn/ds and pn/ps are dna based measurements and should be totally orthogonal to gene expression variability, cell type heterogeneity between samples and such.
To address the reviewers second point, I will here investigate the extent to which PCA (since I use PCs as covariates in eQTL calling instead of PEER factors) is affected by the inclusion of mitochondrial genes which I originally excluded from analysis. To do this, I will compare the PCA results of the gene expression matrix with imputed MT gene expression values (imputed to the sample-wide median for each gene), and compare it to the results if I use actual expression values.

First load necessary libraries for analysis…

library(tidyverse)
library(data.table)
library(knitr)
library(edgeR)
library(gplots)
source("../code/CustomFunctions.R")

Now load the some input files…

First the count table as used in the original manusript…

#Get expression values for the 
CountTableChimpFile <- '../output/PowerAnalysisFullCountTable.Chimp.subread.txt.gz'
CountTableHumanFile <- '../output/PowerAnalysisFullCountTable.Human.subread.txt.gz'
OutputDE <- '../output/Final/TableS2.tab'
DropFileName <- '../data/DE_SamplesToDrop.txt'

DropFile <- read.delim(DropFileName, sep='\t', col.names = c("Sample", "Species"), stringsAsFactors = F)
HumanSamplesToDrop <- DropFile %>% filter(Species=="Human") %>% pull(Sample)
ChimpSamplesToDrop <- DropFile %>% filter(Species=="Chimp") %>% pull(Sample)
DE.results <- read.delim(OutputDE, sep='\t', stringsAsFactors = F)
GeneListForOverdispersionCalculation <- DE.results$Ensembl_geneID

CountTables <- GetCountTables(CountTableChimpFile,
                              CountTableHumanFile,
                              0, GeneListForOverdispersionCalculation, ChimpSampleDrop=ChimpSamplesToDrop, HumanSampleDrop = HumanSamplesToDrop)

Now a fuller count table so we can find expression values from MT genes.

FullTable <- fread("../output/STAR.RawCountTable.txt.gz")

FullTable <- FullTable %>%
  separate(2, into=paste0("C.", strsplit(colnames(FullTable)[2], ' ')[[1]])) %>%
  dplyr::select(-2) %>%
  mutate_at(-1, as.integer)

GeneNames <- read.delim("../data/Biomart_export.Hsap.Ptro.orthologs.txt.gz")


MT.Genes <- GeneNames %>%
  filter(str_detect(Chimpanzee.gene.name,pattern="^MT-")) %>%
  dplyr::select(Chimpanzee.gene.name, Chimpanzee.gene.stable.ID)

#Count table of MT genes
MT.CountTable <- FullTable %>%
  filter(GENE %in% MT.Genes$Chimpanzee.gene.stable.ID) %>%
  column_to_rownames("GENE")
MT.CountTable %>% kable()

	C.4X0212	C.4X0267	C.4X0333	C.4X0339	C.4X0354	C.4X0357	C.4X0550	C.4x0025	C.4x0043	C.4x373	C.4x0430	C.4x0519	C.4x523	C.88A020	C.95A014	C.295	C.317	C.338	C.389	C.438	C.456	C.462	C.476	C.495	C.503	C.522	C.529	C.537	C.549	C.554	C.554_2	C.558	C.570	C.623	C.676	C.724	C.Little_R	C.MD_And
ENSPTRG00000042641	1225139	1395362	626059	1817795	1314663	1487905	1414239	724262	1117381	844931	1069735	1131150	907877	1386558	1119113	2671514	945443	1068457	1948692	1630818	1345994	1853598	2578236	2049414	1903301	1609196	1905046	2436447	2232543	923291	989955	1613475	872765	1255842	1174224	1322953	1635983	1313148
ENSPTRG00000042626	595124	1184959	752864	1346396	1330544	999018	1204887	491613	920392	929378	658610	1006395	829517	1565618	472789	2922113	644442	717594	1644331	834599	790593	1191411	1430169	2115597	1230627	1670707	1786107	2345625	1314135	629247	540243	1507712	478228	1081295	531279	871002	1188400	901150
ENSPTRG00000042642	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSPTRG00000042657	10919560	11207253	7621373	18913511	9783877	11991378	10591085	5063262	8021281	3180011	7576705	9036370	6778170	11438819	8809947	18821780	7776345	11662175	15657083	14310164	12434341	10838457	24634805	9917607	18374093	12572345	15123590	20209687	13005903	11287173	10533101	14801818	6833776	10489326	17177721	11693458	11372879	8920509
ENSPTRG00000042660	2281051	2217094	2471250	4976683	2925848	3028701	2507750	1875989	2243275	1257402	1780727	1638230	2535655	2428793	1890506	5237640	1499587	2337969	2993579	2563361	3326678	2441279	5198371	3013554	3532133	2980545	3154789	6547681	2696388	1829700	1572366	3265725	1615535	2169811	2283633	2243143	3440372	2732462
ENSPTRG00000042653	120	69	58	803	73	65	60	43	49	84918	54	112447	76	110	47	907	70	61	133	101	274	101	246	963	358	104	306149	331	120	77	95	148	68	558	91	150	95	49
ENSPTRG00000042650	1948	894410	1856	12672	1257519	1386395	1104507	521	869441	936150	709	1015588	889	1114737	855	2116107	1262	1223	1313614	971	1571	900153	6123	884452	2803	1190496	2244503	5392	1198	2675	2198	3999	747	780231	1890	2579	2157	978152
ENSPTRG00000042661	1797	491013	1505	4202	211135	663966	478544	456	471689	470307	557	589501	707	573991	690	1060849	928	1185	679592	941	1287	381282	3715	389361	1704	695032	1126484	3957	935	1519	7551	2634	627	446728	1182	2421	1227	524811
ENSPTRG00000042628	399168	268875	300273	647370	394313	268190	335091	205321	198786	223935	294631	219204	311540	301228	253407	598135	306587	438166	331364	427547	375668	287757	713317	368139	505874	401212	278580	1052447	509227	268682	329197	487375	210474	262422	298908	290496	495081	312884
ENSPTRG00000042631	390948	176607	357965	762341	179459	362027	179424	133679	219514	200934	255022	315252	275124	476205	287551	958120	400544	245808	545295	341233	345938	146246	757070	294020	658181	383394	705494	1003542	582681	417394	397242	730114	219884	374371	364539	417088	628863	264270
ENSPTRG00000042639	1392110	1965775	1284030	3176466	1307718	2642693	1817170	665078	1337400	1166799	1107350	1745780	1164278	2092694	1199804	4856059	1449623	1011518	2420903	1544513	1628723	1351917	2777542	1999679	2330633	2281329	3190582	3301221	2306915	1439562	1527431	2274414	971233	1888972	1698314	1282767	2065563	1820314
ENSPTRG00000042651	441717	783666	241447	394041	360707	420344	776491	94651	440235	247369	225283	755897	197704	1021927	463430	1252668	204724	303717	1059799	388412	454757	372984	925955	453910	566058	750072	857840	690980	433615	183957	260048	418771	161251	630265	253936	569209	341702	620303
ENSPTRG00000042630	287	155145	152	269	55666	95115	184778	55	119504	29564	79	138895	45	200023	2708	133743	98	165	177571	94	193	50630	952	41034	186	125266	83507	360	113	172	154	194	50	89916	125	305	124	183474
ENSPTRG00000042625	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
ENSPTRG00000042637	38464	991272	24835	70907	96148	1334270	1061224	198674	734818	652169	27719	680740	33438	1073477	40952	2091075	19850	45846	1166749	51464	311207	128121	600603	105627	54572	861677	1060310	83196	67202	15619	43277	50951	27461	716071	31770	42275	47629	949398

#Add MT genes to count table, and convert to logCPM
Log.Cpm <- bind_rows(
  FullTable %>%
  filter(GENE %in% MT.Genes$Chimpanzee.gene.stable.ID),
  CountTables$Chimp$Counts %>% rownames_to_column("GENE")
) %>%
  column_to_rownames("GENE") %>%
  dplyr::select(-C.4X0095) %>% 
  DGEList() %>%
  calcNormFactors() %>% cpm(log=T) %>% as.data.frame()

#Get median expression of all genes
Median.Log.Cpm <- apply(Log.Cpm, 1, median)

#Plot median expression of all genes, and MT genes
data.frame(Med=Median.Log.Cpm) %>%
  rownames_to_column("GENE") %>%
  mutate(Is.MT.gene = GENE %in% MT.Genes$Chimpanzee.gene.stable.ID) %>%
  ggplot(aes(x=Med, color=Is.MT.gene)) +
  geom_histogram() +
  facet_wrap(~Is.MT.gene, scales="free_y")

#Make "imputed" count table of MT genes based on median expression of those genes across all samples
MT.Med.Log.Cpm.df <- data.frame(Med=Median.Log.Cpm) %>%
  rownames_to_column("GENE") %>%
  filter(GENE %in% MT.Genes$Chimpanzee.gene.stable.ID) %>%
  pull(Med) %>%
  matrix(nrow = nrow(MT.CountTable),ncol = ncol(MT.CountTable)) %>%
  as.data.frame(row.names = rownames(MT.CountTable))
colnames(MT.Med.Log.Cpm.df) <- colnames(Log.Cpm)

#Make count table with the MT expression from the imputed values
Log.Cpm.MT.Genes.Imputed.To.Med <- Log.Cpm %>%
  as.data.frame() %>%
  rownames_to_column("GENE") %>%
  filter(!GENE %in% MT.Genes$Chimpanzee.gene.stable.ID) %>%
  bind_rows(MT.Med.Log.Cpm.df %>% rownames_to_column("GENE")) %>%
  column_to_rownames("GENE")

Log.Cpm.MT.Genes.Imputed.To.Med <- bind_rows(
  MT.Med.Log.Cpm.df %>% rownames_to_column("GENE"),
  Log.Cpm %>%
  as.data.frame() %>%
  rownames_to_column("GENE") %>%
  filter(!GENE %in% MT.Genes$Chimpanzee.gene.stable.ID)
) %>%
  column_to_rownames("GENE")
  
qplot(Log.Cpm.MT.Genes.Imputed.To.Med$C.4X0212, Log.Cpm$C.4X0212)

#PCA for both count tables
PCA.real <- Log.Cpm %>% t() %>% prcomp()
PCA.imputed <- Log.Cpm.MT.Genes.Imputed.To.Med %>% t() %>% prcomp()


summary(PCA.real)

Importance of components:
                           PC1      PC2      PC3      PC4      PC5     PC6
Standard deviation     48.5612 27.36447 24.50734 21.36360 19.60195 18.9923
Proportion of Variance  0.2863  0.09092  0.07292  0.05542  0.04665  0.0438
Cumulative Proportion   0.2863  0.37724  0.45017  0.50558  0.55223  0.5960
                            PC7      PC8      PC9     PC10     PC11    PC12
Standard deviation     17.04537 15.62519 15.40365 15.21628 13.94564 13.3682
Proportion of Variance  0.03528  0.02964  0.02881  0.02811  0.02361  0.0217
Cumulative Proportion   0.63131  0.66095  0.68976  0.71787  0.74149  0.7632
                           PC13     PC14     PC15     PC16     PC17     PC18
Standard deviation     12.70993 12.04979 11.59453 11.16545 10.74883 10.52678
Proportion of Variance  0.01961  0.01763  0.01632  0.01514  0.01403  0.01345
Cumulative Proportion   0.78280  0.80043  0.81675  0.83189  0.84591  0.85937
                           PC19    PC20    PC21    PC22    PC23    PC24    PC25
Standard deviation     10.17483 9.52356 9.18588 8.96669 8.72231 8.51636 8.25668
Proportion of Variance  0.01257 0.01101 0.01025 0.00976 0.00924 0.00881 0.00828
Cumulative Proportion   0.87194 0.88295 0.89320 0.90296 0.91220 0.92100 0.92928
                          PC26    PC27    PC28    PC29    PC30    PC31    PC32
Standard deviation     8.17220 7.83975 7.70779 7.41979 7.09841 6.90808 6.68419
Proportion of Variance 0.00811 0.00746 0.00721 0.00668 0.00612 0.00579 0.00542
Cumulative Proportion  0.93739 0.94485 0.95206 0.95875 0.96487 0.97066 0.97609
                          PC33    PC34    PC35    PC36    PC37      PC38
Standard deviation     6.65103 6.39709 6.26032 6.20019 5.84560 4.751e-14
Proportion of Variance 0.00537 0.00497 0.00476 0.00467 0.00415 0.000e+00
Cumulative Proportion  0.98146 0.98642 0.99118 0.99585 1.00000 1.000e+00

summary(PCA.imputed)

Importance of components:
                          PC1      PC2      PC3      PC4      PC5      PC6
Standard deviation     48.536 27.35180 24.28489 21.31422 19.57805 18.88980
Proportion of Variance  0.289  0.09178  0.07235  0.05573  0.04702  0.04378
Cumulative Proportion   0.289  0.38079  0.45314  0.50888  0.55590  0.59968
                            PC7      PC8      PC9     PC10     PC11     PC12
Standard deviation     17.01909 15.60581 15.20284 15.07386 13.85255 13.34940
Proportion of Variance  0.03554  0.02988  0.02836  0.02788  0.02354  0.02186
Cumulative Proportion   0.63521  0.66509  0.69345  0.72132  0.74487  0.76673
                           PC13     PC14     PC15    PC16     PC17     PC18
Standard deviation     12.68231 11.96573 11.47176 11.0944 10.61526 10.29805
Proportion of Variance  0.01973  0.01757  0.01615  0.0151  0.01382  0.01301
Cumulative Proportion   0.78646  0.80403  0.82017  0.8353  0.84910  0.86211
                          PC19    PC20    PC21    PC22   PC23    PC24    PC25
Standard deviation     9.70311 9.43585 8.96692 8.79606 8.7052 8.44992 8.24773
Proportion of Variance 0.01155 0.01092 0.00986 0.00949 0.0093 0.00876 0.00835
Cumulative Proportion  0.87366 0.88458 0.89445 0.90394 0.9132 0.92200 0.93034
                          PC26    PC27   PC28    PC29    PC30    PC31    PC32
Standard deviation     7.99613 7.68151 7.6064 7.21507 7.08926 6.84747 6.67435
Proportion of Variance 0.00784 0.00724 0.0071 0.00639 0.00617 0.00575 0.00547
Cumulative Proportion  0.93819 0.94542 0.9525 0.95891 0.96507 0.97083 0.97629
                          PC33    PC34    PC35    PC36    PC37      PC38
Standard deviation     6.55250 6.34532 6.23390 6.13808 5.78836 4.589e-14
Proportion of Variance 0.00527 0.00494 0.00477 0.00462 0.00411 0.000e+00
Cumulative Proportion  0.98156 0.98650 0.99127 0.99589 1.00000 1.000e+00

So the fraction of total variance explained by the first 10 PCs in both cases in virtually the same. I will use this to argue that the PC covariates we have already used are similarly effective at capturing cellular state. Let’s make a screeplot to show this visually.

bind_rows(
  summary(PCA.real)$importance %>% t() %>% data.frame() %>% mutate(GeneExpressionMatrix="With MT genes") %>% rownames_to_column("PC"),
  summary(PCA.imputed)$importance %>% t() %>% data.frame() %>% mutate(GeneExpressionMatrix="Without MT genes") %>% rownames_to_column("PC")) %>% 
  mutate(PC=as.numeric(PC)) %>%
  ggplot(aes(x=PC, y=Cumulative.Proportion, color=GeneExpressionMatrix)) +
  geom_line() +
  ylab("Cumulative fraction\nvariance explained") +
  scale_x_continuous(breaks=1:13, limits = c(1,13)) +
  scale_color_discrete(guide = guide_legend(title.position = "top", nrow = 2)) +
  theme_bw() +
  theme(panel.grid.minor = element_blank(), legend.position = "bottom")

And write out the plot…

ggsave("../figures/OriginalArt/ResponseToReviewers.MT.PCA.pdf", height=3, width=3)

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.0.0     gridExtra_2.3     MASS_7.3-51.4     gplots_3.0.1.1   
 [5] edgeR_3.26.8      limma_3.40.6      knitr_1.26        data.table_1.12.8
 [9] forcats_0.4.0     stringr_1.4.0     dplyr_0.8.3       purrr_0.3.3      
[13] readr_1.3.1       tidyr_1.0.0       tibble_2.1.3      ggplot2_3.2.1    
[17] tidyverse_1.3.0  

loaded via a namespace (and not attached):
 [1] httr_1.4.1         jsonlite_1.6       R.utils_2.9.0      modelr_0.1.5      
 [5] gtools_3.8.1       assertthat_0.2.1   highr_0.8          cellranger_1.1.0  
 [9] yaml_2.2.0         pillar_1.4.2       backports_1.1.5    lattice_0.20-38   
[13] glue_1.3.1         digest_0.6.23      promises_1.1.0     rvest_0.3.5       
[17] colorspace_1.4-1   htmltools_0.4.0    httpuv_1.5.2       R.oo_1.23.0       
[21] pkgconfig_2.0.3    broom_0.5.2        haven_2.2.0        scales_1.1.0      
[25] gdata_2.18.0       later_1.0.0        git2r_0.26.1       generics_0.0.2    
[29] farver_2.0.1       ellipsis_0.3.0     withr_2.1.2        lazyeval_0.2.2    
[33] cli_2.0.0          magrittr_1.5       crayon_1.3.4       readxl_1.3.1      
[37] evaluate_0.14      R.methodsS3_1.7.1  fs_1.3.1           fansi_0.4.0       
[41] nlme_3.1-143       xml2_1.2.2         tools_3.6.1        hms_0.5.2         
[45] lifecycle_0.1.0    munsell_0.5.0      reprex_0.3.0       locfit_1.5-9.1    
[49] compiler_3.6.1     caTools_1.17.1.3   rlang_0.4.1        grid_3.6.1        
[53] rstudioapi_0.10    bitops_1.0-6       labeling_0.3       rmarkdown_1.18    
[57] gtable_0.3.0       DBI_1.0.0          R6_2.4.1           lubridate_1.7.4   
[61] zeallot_0.1.0      workflowr_1.5.0    rprojroot_1.3-2    KernSmooth_2.23-16
[65] stringi_1.4.3      Rcpp_1.0.5         vctrs_0.2.0        dbplyr_1.4.2      
[69] tidyselect_0.2.5   xfun_0.11

ResponseToReviewer_Point6