Last updated: 2020-10-25

Checks: 6 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    WorkingManuscript.zip
    Ignored:    WorkingManuscript/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis_temp/.DS_Store
    Ignored:    big_data/
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    code/snakemake_workflow/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    figures/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  analysis/.index.Rmd.swp
    Untracked:  analysis/20200907_Response_Point_02.Rmd
    Untracked:  analysis/20200907_Response_Point_04.Rmd
    Untracked:  analysis/20200907_Response_Point_07.Rmd
    Untracked:  analysis/20200907_Response_Point_08.Rmd
    Untracked:  analysis/20200907_Response_Point_11_2.Rmd
    Untracked:  analysis/20201012_MakeSourceData.Rmd
    Untracked:  data/FinalSourceData/
    Untracked:  data/c5.all.v7.1.symbols.gmt
    Untracked:  data/c5.all.v7.1.symbols.gmt.categories.tsv.gz
    Untracked:  data/h.all.v7.1.symbols.gmt

Unstaged changes:
    Modified:   README.md
    Modified:   analysis/20200907_Response_OriginalComments.Rmd
    Modified:   analysis/20200907_Response_Point_06.Rmd
    Modified:   analysis/20200907_Response_Point_09-2.Rmd
    Modified:   analysis/20200907_Response_Point_09.Rmd
    Modified:   analysis/20200907_Response_Point_11.Rmd
    Modified:   analysis/Final_2_DispersionPlots.Rmd
    Modified:   analysis/about.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis_temp/TabulaMuris_analysis2.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 5f95cbc Benjmain Fair 2020-09-10 update site
html 5f95cbc Benjmain Fair 2020-09-10 update site
Rmd 16b2904 Benjmain Fair 2020-09-10 update site
Rmd 0e88516 Benjmain Fair 2020-09-10 update site, address reviewers
html 0e88516 Benjmain Fair 2020-09-10 update site, address reviewers
Rmd e3ed68f Benjmain Fair 2020-09-09 update site, address reviewers
html e3ed68f Benjmain Fair 2020-09-09 update site, address reviewers

Description:

The reviews are in. I will address each point, one by one, but on this site I will only show work for the points for which I did additional new analyses for. Each of those points has its own link to my new analysis. For more information than what is shown in these R scripts, eLife makes reviewer comments and our full response public upon publication.

Reviewer Comments:

SUMMARY:

This is a solid study, with a large sample size, identifying quantitative trait loci (eQTLs) in humans and chimpanzees, using gene expression data from primary heart samples. The authors complemented the analysis of gene expression with a comparative eQTL mapping, as opposed to relying on mean expression levels, as most comparative studies like this one do. Also unlike many studies focused on mapping associations between genetic and gene regulatory variation, the authors paid attention to the group dispersion/variance of gene expression among samples as well as the evolutionary processes that shape the differences in gene regulation between individuals. The calculation of power for discovering differentially expressed genes as a function of sample size at the beginning of the paper is a thoughtful analysis that is useful to many in the community. All of the analyses are extremely thorough and well-executed. The statistical tests are appropriate and rigorous. Results are interpreted in a conservative fashion.

The main limitation is that the authors are not able to conclusively disambiguate between different causes of dispersion. Genetics, cell type, and technical variation may all contribute to dispersion. The authors state this very clearly throughout the manuscript. In part, this may reflect the authors’ underselling their results somewhat. But in part, this really does reflect reality: Cell type is a major confounder that may provide false signals in other analyses.

REVISIONS FOR THIS PAPER:

The reviewers suggested a number of potential additions to clarify current results or build upon them. I will leave it up to the authors to decide which are worth including in their revision.

  1. The first test authors conducted is to identify differentially variable (DV) genes. A total of 2658 DV genes were identified. The problem of the result is that almost equal number of up- and down-regulated DV genes symmetrically distributed around DV=0. Often, this is an indication of a lack of biological signals in data analysis. This might be due to the pooling of gene groups with diverse functionality together. Therefore, this reviewer suggests that authors should break down genes into subgroups to detail the up and down-regulatory patterns with the hope that some of the gene groups give interpretable results
  • We will offer an explanation that will not require additional analyses
  1. The second test is to correlate the higher coding sequence conservation with lower dispersion. Again, the positive result is not unexpected. There are many indirect and/or confounding factors that may explain the effect. This reviewer, however, understands it is impossible to control them all (also authors have attempted to address some of them in the next few tests). However, here it is better to add exploratory analyses for genes in different functional groups and also give examples of outlier genes that do not follow the rule.
  • `Here I address this point by performing dn/ds correlations with dispersion on a GO category basis.
  1. The third test is to examine the correlation between gene expression variability with single-cell type heterogeneity of samples. Authors first used Tabula Muris dataset to show dispersion is strongly correlated with cell-type specificity/diversity. If this is true, then the point that authors really wanted to demonstrate is, in fact, hampered. Authors might really want to show the “true” single-cell variability (see, for example, PMID: 31861624) is correlated with the level of group variance of gene expression.
  • See point 11
  1. The fourth test authors conducted is to show that dn/ds and pn/ps ratios of genes are correlated with gene expression variability (variance). However, because of the existence of heterogeneity of cell-type composition in samples, any correlation observed may be utterly biased by this single uncontrollable confounding factor. Furthermore, heart tissues contain an over-abundant expression of genes encoded in the mitochondrial genome. The expression level of these mt-genes may vary substantially between samples and reflect the health status of primary sample donors. PEER normalization may have to take this into account as a covariant.
  • `Here by showing that inclusion of MT genes does not change how much of the total variance is explained by the first 10 PCs (rather than PEER) which were included in the eQTL model
  1. Several other tests authors performed are around eQTLs (eGene overlap and eSNP overlap) between the two species. These are typical tests evolutionary biologists usually try to do whenever data is available. However, the issues with these types of tests are the low power in general. More importantly, in order to be consistent with previous tests which are all around the explanation of gene expression variance, this part should address the overlap between expression vQTLs in humans and chimps.
  • We will offer an explanation that will not require additional analyses
  1. I would like to see more discussion about the inter-relatedness of the chimpanzees in the analysis of gene expression. Is that contributing to the power of the DE analysis, which has really high numbers of DE genes. That may certainly be due to the large samples size, but should be addressed. Related to that, the support that the gene-wise dispersion estimates are well correlated in humans and chimpanzees overall (Fig1C, and S4) seems qualitative. It looks like the chimpanzees might have less dispersion overall?
  • `Here I address this point about relatedness empirically by comparing DE analyses using subsets of related and unrelated chimps.
  1. What do the authors think these findings mean for study systems outside of humans and captive chimpanzees? Both on the technical level (e.g. sample size), and for how their approach could be helpful outside of these species. Generalizing this approach would broaden the impact and audience of the paper.
  • `Here I measure which GTEx tissues types are have higher overdispersion (more hetergenous from sample to sample, less power for eQTL mapping and presumably DE expression)..
  1. Did the authors test directly whether eQTLs were enriched in genes with a high dispersion? I could not find this going back through the paper. This seems almost trivially likely to be true. I may have missed this result? Or did the authors worry this is too likely to be confounded with cell type? Either way, this seems like a result that may be useful to show even if the authors did acknowledge that it was likely to be confounded.
  • `Here I show that eGenes do on average have higher dispersion.
  1. Did the authors consider looking for cell-type QTLs? They state several times in the paper the possibility that genetic factors may influence cell types. They have enough data - at least in human - to obtain QTLs for specific cell types, as others have done (Marderstein et. al. Nat Comms 2020; Donovan et. al. Nat Comms 2020). If these cell type QTLs were enriched near genes with a high dispersion, this may bolster the author’s argument that genetic factors underlie dispersion by affecting cell type composition.
  • `Here I provide an analysis doing this suggestion, describing how I quantified cell type composition phenotype in preparation for identifying cell type QTLs.
  • `Here I show the results of that cell type QTL association analysis.
  1. The scRNA-seq reference used for estimating cell types in heart tissue was derived from mice. Could this lead the authors to underestimate the degree to which cell types drive dispersion in genes that are variable between human and chimp? Genes that are variable between human/ chimp may also be more likely to be variable between either species and mouse, and perhaps this variability has led to them becoming more/ less of a marker of a specific cell population (and hence their dispersion in primates does not correlate with cell type specificity in mouse).
  • Agreed. Did not provide any additional analysis for this.
  1. Have the authors tried estimating dispersion on top of what is expected based on differences in cell type? There are several strategies that might work for this: There are new strategies for estimating a posterior of cell type specific expression from a bulk sample, conditional on scRNA-seq data as prior information (Chu and Danko, bioRxiv, 2020). These cell type specific expression estimates could then be analyzed for dispersion. Alternatively, it may also work to regress the estimated proportion of each cell type out of the dispersion estimates. While there are certainly a lot of pitfalls with using these strategies, especially in the setting shown here (all of this would work better if there were species matched reference data), they might provide an avenue for depleting the contribution of cell type differences from dispersion estimates.
  • `Here I provide an analysis using TED to estimate cell type specific expression and estimate dispersion.
  • `Here I use those cell-type specific expression and dispersion estimates and present the results and correlation matrix of dispersion estimates across cell types and species… Note that while the correlation matrix is based on point estimates of dispersion, I used a bootstrapping approach to understand how stable these estimates are (see methods section of paper) and the point estimate and standard error of these dispersion estimates will be available as a source data file for the related figure.
  1. Can the authors add a dotted line to show the shape of the distribution for genes with low dispersion, or where dispersion is shared in both human and chimpanzee, in figure 4b? Is this different from genes that are dispersed in either chimp or human?
  • I addressed this by modifying the original Rmarkdown entry that made figure4B, to include a scatterplot to show the distribution as requested.

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.5.0 Rcpp_1.0.5      rprojroot_1.3-2 digest_0.6.23  
 [5] later_1.0.0     R6_2.4.1        backports_1.1.5 git2r_0.26.1   
 [9] magrittr_1.5    evaluate_0.14   stringi_1.4.3   rlang_0.4.7    
[13] fs_1.3.1        promises_1.1.0  whisker_0.4     rmarkdown_1.18 
[17] tools_3.6.1     stringr_1.4.0   glue_1.4.2      httpuv_1.5.2   
[21] xfun_0.11       yaml_2.2.0      compiler_3.6.1  htmltools_0.4.0
[25] knitr_1.26