Last updated: 2020-09-23

Checks: 6 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    WorkingManuscript.zip
    Ignored:    WorkingManuscript/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis_temp/.DS_Store
    Ignored:    big_data/
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    code/snakemake_workflow/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    figures/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  analysis/20200907_Response_Point_02.Rmd
    Untracked:  analysis/20200907_Response_Point_04.Rmd
    Untracked:  data/c5.all.v7.1.symbols.gmt
    Untracked:  data/c5.all.v7.1.symbols.gmt.categories.tsv.gz
    Untracked:  data/h.all.v7.1.symbols.gmt

Unstaged changes:
    Modified:   analysis/20200907_Response_OriginalComments.Rmd
    Modified:   analysis/20200907_Response_Point_06.Rmd
    Modified:   analysis/20200907_Response_Point_09-2.Rmd
    Modified:   analysis/20200907_Response_Point_09.Rmd
    Modified:   analysis/20200907_Response_Point_11.Rmd
    Modified:   analysis/Final_2_DispersionPlots.Rmd
    Modified:   analysis_temp/TabulaMuris_analysis2.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 25874ad Benjmain Fair 2020-09-10 update site
html 25874ad Benjmain Fair 2020-09-10 update site
Rmd 5f95cbc Benjmain Fair 2020-09-10 update site
html 5f95cbc Benjmain Fair 2020-09-10 update site

Original reviewer point:

I would like to see more discussion about the inter-relatedness of the chimpanzees in the analysis of gene expression. Is that contributing to the power of the DE analysis, which has really high numbers of DE genes. That may certainly be due to the large samples size, but should be addressed. Related to that, the support that the gene-wise dispersion estimates are well correlated in humans and chimpanzees overall (Fig1C, and S4) seems qualitative. It looks like the chimpanzees might have less dispersion overall?

I will address the first point (about relatedness in DE analysis) two ways. Firstly, I will use variance partition to argue that the relatedness of some of the chimps probably has little impact on the interspecies DE analysis, given that a relatively much larger fraction of variance will probably be explained species, batch, etc. Then I will empricially address this by reperforming DE analysis with subsamples of chimps that have some relatively high degree of inter-relatedness, compared to subsamples that do not have such relatedness.

First load necessary libraries for analysis…

library(tidyverse)
library(gplots)
library(readxl)
library(scales)
library(variancePartition)
library('limma')
library('edgeR')
library(pbapply)
library(knitr)
library(readxl)

source("../code/CustomFunctions.R")

Load relatedness matrix and RNA-seq sample batch info

#Relatedness matrix
SampleLabels <- read.table('../output/ForAssociationTesting.temp.fam', stringsAsFactors = F)$V2
GemmaMatrix <- as.matrix(read.table('../output/GRM.cXX.txt'))
colnames(GemmaMatrix) <- SampleLabels
row.names(GemmaMatrix) <- SampleLabels

#Metadata like RNA-seq batch
Metadata<-read_excel("../data/Metadata_SequencedChimps.xlsx")
Metadata$RNA.Extract_date %>% unique() %>% length()
[1] 5
Metadata$RNA.Library.prep.batch %>% unique() %>% length()
[1] 5
Colors <- data.frame(Numbers=1:5, Colors=hue_pal()(5))

BatchColor <- data.frame(IndividualID=as.character(colnames(GemmaMatrix))) %>% 
  left_join(Metadata, by=c("IndividualID"="IndividualID (as listed in vcf)")) %>%
  dplyr::select(IndividualID, RNA.Library.prep.batch) %>%
  left_join(Colors, by=c("RNA.Library.prep.batch"="Numbers")) %>%
  pull(Colors) %>% as.character()

diag(GemmaMatrix) <- NA #For plotting purposes, just don't show the diagonol so that the color scale is better for non-diagnol entries

hclustfunc <- function(x) hclust(x, method="complete")
distfunc <- function(x) dist(x,method="euclidean")
d <- distfunc(GemmaMatrix)
fit <- hclustfunc(d)
groups <- cutree(fit, k=7) 

NumbersToColors <- function(NumberVector){
  N <- length(unique(NumberVector))
  Key <- setNames(hue_pal()(N), 1:N)
  return(list(ColorVector=recode(NumberVector, !!!Key), Key=Key))
}

ClusterGroupColors <- NumbersToColors(groups)

heatmap.2(GemmaMatrix, trace="none", RowSideColors = ClusterGroupColors$ColorVector, ColSideColors = BatchColor)

Version Author Date
25874ad Benjmain Fair 2020-09-10
5f95cbc Benjmain Fair 2020-09-10

Column colors are RNA library prep batch, row colors are cluster groups (cutree function). I may pick out some of those somewhat related sample blocks for DE analysis…

Groups.df<-as.data.frame(groups) %>% rownames_to_column("Ind")

GemmaMatrix %>% as.data.frame() %>%
  rownames_to_column("IndA") %>%
  gather(key="IndB", value="Kinship", -IndA) %>%
  left_join(Groups.df, by=c("IndA"="Ind")) %>%
  left_join(Groups.df, by=c("IndB"="Ind"), suffix=c(".A.group", ".B.group")) %>%
  filter(groups.A.group==groups.B.group & !IndA==IndB) %>%
  mutate(uniqueID = paste0(pmax(IndA, IndB), pmin(IndA, IndB))) %>%
  distinct(uniqueID, .keep_all=T) %>%
  mutate(groups.A.group=factor(groups.A.group)) %>%
  ggplot(aes(x=groups.A.group, y=Kinship, color=groups.A.group)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(position=position_jitter(0.2)) +
  ylab("Pairwise kinship coefficients within cluster") +
  xlab("Cluster") +
  scale_colour_manual(name = "ClusterGroup",values = ClusterGroupColors$Key) +
  theme_bw() +
  theme(legend.position = "none")

Now let’s select samples from cluster 1,3,6,7, and perform DE analysis using only 4 samples (4Chimp + 4Human) from each cluster and compare. If relatedness plays a big (meaningful) role, I expect meaningful differences in the the number of DE genes, the fraction of false discoveries, etc. Since clusters 3, and 7 only have 4 or 5 samples, I will use all 4 samples from these clusters. For cluster6, I will exclude the least related sample (smallest average pairwase intra-cluster relatedness). For cluster1, since there are so many samples, most of which have small relatedness coefficients, I exclude the 6 samples with a relatedness coefficient > 0.05, (leaving all unrelated samples) and randomly draw (with replacement) subsamples of 4 for DE analysis, repeating this procedure many times to establish a null distribution to compare clusters 3,6 & 7.

#First, create a dataframe containing only the samples that will be considered for further analysis as described above.
#Show cluster identities:
Groups.df %>% head() %>% kable()
Ind groups
549 1
570 2
389 3
456 1
623 3
438 1
#Get the 4 samples to include for cluster3
Cluster3.Pool <- Groups.df %>% filter(groups==3) %>% pull(Ind)

#Get the 4 samples to include for cluster7
Cluster7.Pool <- Groups.df %>% filter(groups==7) %>% pull(Ind)

#Get samples to include from cluster6
#I want to include 4/5 in that cluster, excluding the sample with lowest mean pairwise kinship.
CLuster6.MeanpairwiseKinship <- GemmaMatrix %>% as.data.frame() %>%
  rownames_to_column("IndA") %>%
  gather(key="IndB", value="Kinship", -IndA) %>%
  left_join(Groups.df, by=c("IndA"="Ind")) %>%
  left_join(Groups.df, by=c("IndB"="Ind"), suffix=c(".A.group", ".B.group")) %>%
  distinct(uniqueID, .keep_all=T) %>%
  filter(groups.A.group==6 & groups.B.group==6 & !IndA==IndB) %>%
  group_by(IndA) %>%
  summarize(MeanPairwiseKinship=mean(Kinship))
kable(CLuster6.MeanpairwiseKinship)
IndA MeanPairwiseKinship
295 0.0338598
4x0430 -0.0009190
4x0519 0.0017661
529 0.0310048
554 0.0106422
Cluster6.Pool <- CLuster6.MeanpairwiseKinship %>%
  filter(MeanPairwiseKinship>min(MeanPairwiseKinship)) %>% pull(IndA)

#Get cluster1 pool. I want to include all samples in cluter1 that do not have any related individuals (kinship coefficients > 0.05)
CLuster1.MaxpairwiseKinship <- GemmaMatrix %>% as.data.frame() %>%
  rownames_to_column("IndA") %>%
  gather(key="IndB", value="Kinship", -IndA) %>%
  left_join(Groups.df, by=c("IndA"="Ind")) %>%
  left_join(Groups.df, by=c("IndB"="Ind"), suffix=c(".A.group", ".B.group")) %>%
  distinct(uniqueID, .keep_all=T) %>%
  filter(groups.A.group==1 & groups.B.group==1 & !IndA==IndB) %>%
  group_by(IndA) %>%
  summarize(MaxPairwiseKinship=max(Kinship))
kable(CLuster1.MaxpairwiseKinship)
IndA MaxPairwiseKinship
317 0.1244824
338 0.0068539
438 0.0080694
456 0.0095653
476 0.0072819
4x0025 0.0072819
4x0043 0.0095653
4X0095 0.0150108
4X0212 0.0617353
4X0267 0.0617353
4X0550 0.0150108
522 0.0180964
537 0.0052832
549 0.0638122
554_2 0.0638122
558 0.1244824
88A020 0.0026078
95A014 0.0049990
Little_R 0.0180964
Cluster1.Pool <- CLuster1.MaxpairwiseKinship %>%
  filter(MaxPairwiseKinship<0.05) %>% pull(IndA)

SamplesToDrawFrom <- cbind(Cluster1.Pool, Cluster3.Pool, Cluster6.Pool, Cluster7.Pool) %>%
  as.data.frame() %>%
  gather(key="Cluster", "Ind") %>%
  distinct(.keep_all=T)

SamplesToDrawFrom %>% kable()
Cluster Ind
Cluster1.Pool 338
Cluster1.Pool 438
Cluster1.Pool 456
Cluster1.Pool 476
Cluster1.Pool 4x0025
Cluster1.Pool 4x0043
Cluster1.Pool 4X0095
Cluster1.Pool 4X0550
Cluster1.Pool 522
Cluster1.Pool 537
Cluster1.Pool 88A020
Cluster1.Pool 95A014
Cluster1.Pool Little_R
Cluster3.Pool 389
Cluster3.Pool 623
Cluster3.Pool 503
Cluster3.Pool 4x523
Cluster6.Pool 295
Cluster6.Pool 4x0519
Cluster6.Pool 529
Cluster6.Pool 554
Cluster7.Pool 4x373
Cluster7.Pool 4X0333
Cluster7.Pool 4X0357
Cluster7.Pool 4X0339

Ok. So now I have pools of samples to draw from for clusters 1,3,6,7.

I will do one DE analysis for clusters 3,6,7, and many analyses for cluster1 to establish a null distribution of no relatedness. I think for this question it makes sense to make the repeated analysis with random from cluster1 sampling without replacement. So there are \(13 nCr 4 = 715\) possible combinations of samples to create this null distribution. That is a small enough number I will just do all of them to establish as rich of a null as possible.

But first, let’s choose the human samples to compare to for this DE analysis. I think there is no point in introducing randomness to the human samples drawn. But I think it does make sense to carefully choose them to be as high quality as possible, for example, by controlling for batch/lab. Therefore, I think I should use 4 human sampleschosen from Pavlovic et al in my DE analysis. Let’s pick these randomly.

ColColors <- ClusterGroupColors$ColorVector
ColColors[!(rownames(GemmaMatrix) %in% SamplesToDrawFrom$Ind)] <- NA
heatmap.2(GemmaMatrix, trace="none", RowSideColors = ClusterGroupColors$ColorVector, ColSideColors = ColColors)

PairwiseKinshipBoxplot <- GemmaMatrix %>% as.data.frame() %>%
  rownames_to_column("IndA") %>%
  gather(key="IndB", value="Kinship", -IndA) %>%
  left_join(Groups.df, by=c("IndA"="Ind")) %>%
  left_join(Groups.df, by=c("IndB"="Ind"), suffix=c(".A.group", ".B.group")) %>%
  filter(groups.A.group==groups.B.group & !IndA==IndB) %>%
  filter(IndA %in% SamplesToDrawFrom$Ind & IndB %in% SamplesToDrawFrom$Ind ) %>%
  mutate(uniqueID = paste0(pmax(IndA, IndB), pmin(IndA, IndB))) %>%
  distinct(uniqueID, .keep_all=T) %>%
  mutate(groups.A.group=factor(groups.A.group)) %>%
  ggplot(aes(x=groups.A.group, y=Kinship, color=groups.A.group)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(position=position_jitter(0.2)) +
  ylab("Pairwise kinship coefficients\nwithin cluster") +
  xlab("Filtered kinship cluster") +
  scale_colour_manual(name = "ClusterGroup",values = ClusterGroupColors$Key) +
  theme_bw() +
  theme(legend.position = "none", axis.text.x=element_blank())
PairwiseKinshipBoxplot

Now let’s wrap the desired DE analysis into a function for easier repitition:

###Define function
DE.analysis <- function(DGE.NormalizedRawCountTable, Log2RPKMCountTable, ChimpSamples, HumanSamples){
  AllSamplesToSubset <- c(ChimpSamples, HumanSamples)
  CountTableSampled <- DGE.NormalizedRawCountTable[,AllSamplesToSubset]
  SpeciesFactor <- colnames(CountTableSampled) %>% substr(1,1) %>% factor() %>% unclass() %>% as.character()
  mm <- model.matrix(~0 + SpeciesFactor)
  y <- voom(CountTableSampled, mm, normalize.method="cyclicloess", plot=F)
  y$E <- Log2RPKMCountTable[,AllSamplesToSubset]
  
  fit<- lmFit(y, mm)
  contr <- makeContrasts(DE=SpeciesFactor1-SpeciesFactor2, levels = mm)
  tmp <- contrasts.fit(fit, contrasts=contr)
  efit <- eBayes(tmp)
  Results <- topTable(efit, sort.by = "none", n=Inf)
  return(Results)
}

###Create some of the required arguments

CountTableChimpFile <- '../output/PowerAnalysisFullCountTable.Chimp.subread.txt.gz'
CountTableHumanFile <- '../output/PowerAnalysisFullCountTable.Human.subread.txt.gz'
OutputDE <- '../output/Final/TableS2.tab'
DropFileName <- '../data/DE_SamplesToDrop.txt'

DropFile <- read.delim(DropFileName, sep='\t', col.names = c("Sample", "Species"), stringsAsFactors = F)
HumanSamplesToDrop <- DropFile %>% filter(Species=="Human") %>% pull(Sample)
ChimpSamplesToDrop <- DropFile %>% filter(Species=="Chimp") %>% pull(Sample)
DE.results <- read.delim(OutputDE, sep='\t', stringsAsFactors = F)
GeneListForOverdispersionCalculation <- DE.results$Ensembl_geneID

CountTables <- GetCountTables(CountTableChimpFile,
                              CountTableHumanFile,
                              0, GeneListForOverdispersionCalculation, ChimpSampleDrop=ChimpSamplesToDrop, HumanSampleDrop = HumanSamplesToDrop)

DGE.NormalizedRawCountTable <- cbind(CountTables$Chimp$Counts, CountTables$Human$Counts )
  DGEList() %>%
  calcNormFactors()
An object of class "DGEList"
$counts
<0 x 0 matrix>

$samples
[1] group        lib.size     norm.factors
<0 rows> (or 0-length row.names)
Log2RPKMCountTable <- cbind(CountTables$Chimp$log2RPKM, CountTables$Human$log2RPKM)

set.seed(0)
HumanSamples <- colnames(CountTables$Human$Counts)[CountTables$Human$Counts %>% colnames() %>% str_detect("SRR", negate = T)] %>%
  sample(4)
ChimpSamples <- paste0("C.", Cluster3.Pool)


###print the required arguments to illustrate what is going on
DGE.NormalizedRawCountTable[1:10,1:10] %>% kable()
C .4x0519 C .4x373 C .476 C .4x0025 C .438 C .462 C .4X0354 C .503 C .317 C .623
ENSG00000186891 46 73 20 28 4 11 98 13 11 3
ENSG00000186827 274 393 313 397 173 128 975 236 307 84
ENSG00000078808 5967 1931 7637 3892 2957 3745 1189 5857 4306 3704
ENSG00000176022 1097 402 1574 1333 955 1235 416 1087 745 640
ENSG00000184163 7 49 113 62 62 29 111 82 24 18
ENSG00000160087 761 283 1153 776 471 610 275 920 547 604
ENSG00000162572 7 4 17 2 5 16 12 11 9 10
ENSG00000131584 2587 1537 3646 2546 1755 1323 4585 2250 1484 1872
ENSG00000169972 282 78 396 257 209 282 111 243 141 252
ENSG00000127054 3009 2009 7404 3300 3600 2092 3147 4403 2740 3116
Log2RPKMCountTable[1:10,1:10] %>% kable()
C.4x0519 C.4x373 C.476 C.4x0025 C.438 C.462 C.4X0354 C.503 C.317 C.623
ENSG00000186891 -0.4631141 1.5452701 -1.7906428 -0.7436310 -2.7486168 -2.037643 1.502654 -1.8335368 -1.867155 -3.0690602
ENSG00000186827 1.8312433 3.7552296 1.7373802 2.7719656 1.9481646 1.020653 4.595932 1.8981157 2.472714 0.8400153
ENSG00000078808 4.9992002 4.7877280 5.0681744 4.7966794 4.7681080 4.602012 3.621035 5.2552587 5.012015 5.0089176
ENSG00000176022 3.0009070 2.9687688 3.2341574 3.6942616 3.5811589 3.445338 2.551394 3.2700042 2.926360 2.9213715
ENSG00000184163 -3.1250345 0.5876174 0.1042405 -0.0550313 0.3005269 -1.218992 1.285548 0.2081517 -1.277634 -1.4597615
ENSG00000160087 2.3164984 2.3055132 2.6277490 2.7569100 2.4056076 2.272207 1.798519 2.8713177 2.323592 2.6794380
ENSG00000162572 -4.2857112 -3.8425882 -3.5385672 -5.0822034 -4.0764170 -3.142342 -2.918254 -3.5855126 -3.656867 -3.3581316
ENSG00000131584 3.2105543 3.8747917 3.4183582 3.6009371 3.4321338 2.518798 4.982773 3.2921295 2.892687 3.4412337
ENSG00000169972 2.2566182 1.8242974 2.4572966 2.5336727 2.6025630 2.528399 1.864482 2.3241553 1.747091 2.7869341
ENSG00000127054 4.0532149 4.8857942 5.0644684 4.5997661 5.0927794 3.803711 5.064977 4.8847706 4.401255 4.8006463
HumanSamples
[1] "H.63145" "H.62606" "H.59167" "H.59263"
ChimpSamples
[1] "C.389"   "C.623"   "C.503"   "C.4x523"
###Test the function
Results <- DE.analysis(DGE.NormalizedRawCountTable=DGE.NormalizedRawCountTable, ChimpSamples=ChimpSamples, Log2RPKMCountTable=Log2RPKMCountTable, HumanSamples=HumanSamples)
Results %>% head() %>% kable()
logFC AveExpr t P.Value a dj.P.Val B
ENSG00000186891 1.1813874 -1.6752344 1.130220 0.2869894 0.4529246 -5.551883
ENSG00000186827 -1.2248842 2.5200273 -2.274735 0.0483767 0.1449773 -4.895913
ENSG00000078808 0.2885241 5.0609380 1.559132 0.1526664 0.3005444 -6.245397
ENSG00000176022 0.1927662 3.0497948 1.255078 0.2403959 0.4041648 -6.572526
ENSG00000184163 -0.7863931 0.0711966 -1.606493 0.1418966 0.2868254 -5.390743
ENSG00000160087 0.3136660 2.5227387 1.724462 0.1179827 0.2542184 -5.893499

Ok, so I have a function to do the DE analysis. Let’s use it to compare all combinations of the Cluster1.Pool of chimp samples vs 4 human samples, and do a similar analysis but with the cluster3Pool, cluster6Pool, and cluster7Pool.

#First get all combinations of samples in cluster1Pool.

Cluster1.Combinations <- combn(Cluster1.Pool, 4, simplify = F)
Cluster1.Combinations.Pasted <- lapply(Cluster1.Combinations, function(iter) paste0("C.", iter))
length(Cluster1.Combinations.Pasted)
[1] 715
Cluster1.Combinations.Pasted[1:10]
[[1]]
[1] "C.338" "C.438" "C.456" "C.476"

[[2]]
[1] "C.338"    "C.438"    "C.456"    "C.4x0025"

[[3]]
[1] "C.338"    "C.438"    "C.456"    "C.4x0043"

[[4]]
[1] "C.338"    "C.438"    "C.456"    "C.4X0095"

[[5]]
[1] "C.338"    "C.438"    "C.456"    "C.4X0550"

[[6]]
[1] "C.338" "C.438" "C.456" "C.522"

[[7]]
[1] "C.338" "C.438" "C.456" "C.537"

[[8]]
[1] "C.338"    "C.438"    "C.456"    "C.88A020"

[[9]]
[1] "C.338"    "C.438"    "C.456"    "C.95A014"

[[10]]
[1] "C.338"      "C.438"      "C.456"      "C.Little_R"

Note that since this takes a long time to compute the DE results for 715 combinations, I will turn this code block to eval=F and write out the results, and read in the results in the next code block…

#iterate over combinations list, and calculate DE results for each
Cluster1.Combinations.Results <-pbapply::pblapply(Cluster1.Combinations.Pasted, function(iter) DE.analysis(HumanSamples=HumanSamples, Log2RPKMCountTable=Log2RPKMCountTable, DGE.NormalizedRawCountTable=DGE.NormalizedRawCountTable, ChimpSamples = iter))


#Write out results
saveRDS(Cluster1.Combinations.Results, file = "../big_data/Cluster1_DE_Results_AllCombinations.rds")

Let’s start by plotting the number of DE genes (at a few FDR thresholds) when each of the clusters 3, 6, and 7 were used for the chimp samples, and compare it to the distribution across all 715 combinations from cluster 1.

#Read in results
Cluster1.Combinations.Results <- readRDS("../big_data/Cluster1_DE_Results_AllCombinations.rds")

Cluster1.Combinations.Results.df <- Cluster1.Combinations.Results %>%
  lapply(add_rownames, "Ensembl_geneID") %>%
  bind_rows(.id="CombinationNumber")

#DE analysis for other cluster groups
OtherClusters.df <- SamplesToDrawFrom %>% filter(!Cluster=="Cluster1.Pool") %>%
  mutate(Ind=paste0("C.", Ind))
OtherClusters.list <- split(OtherClusters.df$Ind, OtherClusters.df$Cluster)

OtherClusters.Results <-pbapply::pblapply(OtherClusters.list, function(iter) DE.analysis(HumanSamples=HumanSamples, Log2RPKMCountTable=Log2RPKMCountTable, DGE.NormalizedRawCountTable=DGE.NormalizedRawCountTable, ChimpSamples = iter))

OtherClusters.Results.df <- OtherClusters.Results %>%
  lapply(add_rownames, "Ensembl_geneID") %>%
  bind_rows(.id="ClusterNumber") %>%
  mutate(ClusterNumber=gsub("Cluster(\\d).Pool", "\\1", ClusterNumber))
  

NumResults <- OtherClusters.Results.df %>%
  dplyr::select(adj.P.Val, ClusterNumber) %>%
  mutate(FDR.10 = adj.P.Val < 0.10,
         FDR.05 = adj.P.Val < 0.05,
         FDR.01 = adj.P.Val < 0.01,) %>%
  gather(key="FDR", value="Signif", -adj.P.Val, -ClusterNumber) %>%
  group_by(ClusterNumber, FDR) %>%
  summarise(NumSigGenes=sum(Signif))


ToPlotNumSig <- Cluster1.Combinations.Results.df %>%
  dplyr::select(adj.P.Val, CombinationNumber) %>%
  mutate(FDR.10 = adj.P.Val < 0.10,
         FDR.05 = adj.P.Val < 0.05,
         FDR.01 = adj.P.Val < 0.01,) %>%
  gather(key="FDR", value="Signif", -adj.P.Val, -CombinationNumber) %>%
  group_by(CombinationNumber, FDR) %>%
  summarise(NumSigGenes=sum(Signif)) %>%
  mutate(Cluster1.Combinations="1")

NumDE.Plot <- ggplot(ToPlotNumSig, aes(x=NumSigGenes)) +
  # stat_ecdf(color=ClusterGroupColors$Key[1]) +
  geom_density(aes(fill=Cluster1.Combinations)) +
  geom_vline(data=NumResults, size=1, aes(xintercept=NumSigGenes, color=ClusterNumber)) +
  facet_wrap(~FDR, scales="free_y") +
  ylab("empirical probability density") +
  xlab("Num DE genes") +
  scale_colour_manual(name = "Cluster",values = ClusterGroupColors$Key, labels=NULL) +
  scale_fill_manual(name = "All nCr(13,4)=715\ncombinations",values = ClusterGroupColors$Key, labels=NULL) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45))

NumDE.Plot

The bimodal distribution is a bit odd, and probably reflects two or three “bad” samples in the group of 13 samples in cluster13 which skew results to less genes when randomly drawn. I suppose odd things are expected when randomly drawing from such a small sample size. In any case, I think we are fundamentally limited towards fully emperically quantifying the relatedness effect reviewers comment. But in any case, it doesn’t seem like there is any drastic effect.

Let’s make similar plots but plotting the empirical FDR (based on FDR<0.01 genes from the full data set as the ad hoc gold standard).

ToPlotFDR <- Cluster1.Combinations.Results.df %>%
  dplyr::select(adj.P.Val, CombinationNumber, Ensembl_geneID) %>%
  mutate(FDR.10 = adj.P.Val < 0.10,
         FDR.05 = adj.P.Val < 0.05,
         FDR.01 = adj.P.Val < 0.01) %>%
  dplyr::select(FDR.10:FDR.01, CombinationNumber, Ensembl_geneID) %>%
  gather(key="FDR", value="Signif", -CombinationNumber, -Ensembl_geneID) %>%
  left_join(DE.results, by="Ensembl_geneID") %>%
  mutate(TrueResponse = adj.P.Val < 0.01) %>%
  mutate(CorrectResponse = (TrueResponse & Signif)) %>%
  group_by(CombinationNumber, FDR) %>%
  summarise(NumSigGenes=1-sum(CorrectResponse)/sum(Signif)) %>%
  mutate(Cluster1.Combinations="1")


FDR.Results <- OtherClusters.Results.df %>%
  dplyr::select(adj.P.Val, ClusterNumber, Ensembl_geneID) %>%
  mutate(FDR.10 = adj.P.Val < 0.10,
         FDR.05 = adj.P.Val < 0.05,
         FDR.01 = adj.P.Val < 0.01) %>%
  dplyr::select(FDR.10:FDR.01, ClusterNumber, Ensembl_geneID) %>%
  gather(key="FDR", value="Signif", -ClusterNumber, -Ensembl_geneID) %>%
  left_join(DE.results, by="Ensembl_geneID") %>%
  mutate(TrueResponse = adj.P.Val < 0.01) %>%
  mutate(CorrectResponse = (TrueResponse & Signif)) %>%
  group_by(ClusterNumber, FDR) %>%
  summarise(NumSigGenes=1-sum(CorrectResponse)/sum(Signif))

FDR.Plot <- ggplot(ToPlotFDR, aes(x=NumSigGenes)) +
  # stat_ecdf(color=ClusterGroupColors$Key[1]) +
  geom_density(aes(fill=Cluster1.Combinations)) +
  geom_vline(data=FDR.Results, size=1, aes(xintercept=NumSigGenes, color=ClusterNumber)) +
  facet_wrap(~FDR, scales="free_y") +
  ylab("empirical probability density") +
  xlab("Empirical FDR estimate") +
  scale_colour_manual(name = "Cluster",values = ClusterGroupColors$Key, labels=NULL) +
  scale_fill_manual(name = "All nCr(13,4)=715\ncombinations",values = ClusterGroupColors$Key, labels=NULL) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45))
FDR.Plot

Now let’s try to explain that bimodal shape in the cluster1 combinations… Let’s start by looking at a PCA or hiearchal clustering of the cluster1 samples to look for outliers, and see if the inclusion of such outliers in the combinations is causing the bimodal shape.

Cluster1.CountTable <- CountTables$Chimp$log2RPKM %>%
  dplyr::select(one_of(
    paste0("C.", SamplesToDrawFrom %>% filter(Cluster=="Cluster1.Pool") %>% pull(Ind))
  ))

PCA <- Cluster1.CountTable %>%
  t() %>% prcomp()

Metadata <- read.delim("../output/Final/TableS1.tab", stringsAsFactors = F) %>%
  dplyr::select(IID=Sample.ID, SX, SP, UniquelyMappingReads, RNA.Extraction.Batch, RIN, PercentUniquelyMappingReads) %>%
  mutate(IID=paste0("C.", IID))

PCA$x %>% as.data.frame() %>% dplyr::select(PC1:PC2) %>%
  rownames_to_column("IID") %>%
  left_join(Metadata, by="IID") %>%
  ggplot(aes(x=PC1, y=PC2, color=PercentUniquelyMappingReads)) +
  geom_text(aes(label=IID)) +
  theme_bw()

Cluster1.CountTable %>% cor() %>%
  heatmap.2()

CountTables$Chimp$log2RPKM %>%
  dplyr::select(one_of(
    paste0("C.", SamplesToDrawFrom %>% pull(Ind))
  )) %>% cor() %>%
  heatmap.2()

There are indeed two outliers. Let’s see if those are responsible for the bimodal shape…

ClusterOneCombinationsWithOutliers <- Cluster1.Combinations.Pasted %>%
  unlist() %>%
  matrix(byrow=T, nrow=length(Cluster1.Combinations.Pasted)) %>%
  as.data.frame() %>%
  mutate(CombinationNumber=1:n(),
         CombinationPasted=paste(V1,V2,V3,V4)) %>%
  mutate(ContainsOutliers=str_detect(CombinationPasted, "C.Little_R|C.537")) %>%
  filter(ContainsOutliers==T) %>%
  pull(CombinationNumber)


ToPlotNumSig %>%
  mutate(ContainsOutliers=CombinationNumber %in% ClusterOneCombinationsWithOutliers) %>%
ggplot(aes(x=NumSigGenes)) +
  # stat_ecdf(color=ClusterGroupColors$Key[1]) +
  geom_density(aes(fill=Cluster1.Combinations, linetype=ContainsOutliers)) +
  geom_vline(data=NumResults, size=1, aes(xintercept=NumSigGenes, color=ClusterNumber)) +
  facet_wrap(~FDR, scales="free_y") +
  ylab("empirical probability density") +
  xlab("Num DE genes") +
  scale_colour_manual(name = "Cluster",values = ClusterGroupColors$Key, labels=NULL) +
  scale_fill_manual(name = "All nCr(13,4)=715\ncombinations",values = ClusterGroupColors$Key, labels=NULL) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45))

Ok. I think I want to just remake some of the previous plots, but show how these outliers have a different distribution.

Start with a heatmap of RNA-seq data from all the samples included in this analysis, colored by group, and batch.

I’m going to set this code block to eval=F so that it doesn’t take as long to build this html doc.

CorMatToPlot <- CountTables$Chimp$log2RPKM %>%
  dplyr::select(one_of(
    paste0("C.", SamplesToDrawFrom %>% pull(Ind))
  )) %>% cor()

MetadataForHeatMap <- data.frame(C.Ind=rownames(CorMatToPlot)) %>%
  mutate(Ind=str_remove(C.Ind, "^C.")) %>%
  left_join(SamplesToDrawFrom, by ="Ind") %>%
  mutate(ClusterNumber=gsub("Cluster(\\d).Pool", "\\1", Cluster)) %>%
  left_join(data.frame(Color=ClusterGroupColors$Key) %>% rownames_to_column("ClusterNumber"), by="ClusterNumber") %>%
  left_join(Metadata %>% dplyr::select(RNA.Extraction.Batch, IID), by=c("C.Ind"="IID"))


rownames(CorMatToPlot) <- str_remove(rownames(CorMatToPlot), "^C.")
colnames(CorMatToPlot) <- str_remove(colnames(CorMatToPlot), "^C.")

pdf(file="../figures/OriginalArt/ResponseToReviewers.DE.ExpressionMat.pdf")
heatmap.2(CorMatToPlot, trace="none", RowSideColors = MetadataForHeatMap$RNA.Extraction.Batch %>% as.factor() %>% as.numeric() %>% as.character(), ColSideColors = MetadataForHeatMap$Color %>% as.character(), dendrogram="row")
dev.off()


pdf(file="../figures/OriginalArt/ResponseToReviewers.DE.KinshipMat.pdf")
heatmap.2(GemmaMatrix, trace="none", RowSideColors = ClusterGroupColors$ColorVector, ColSideColors = ColColors, dendrogram="row")
dev.off()

NumDE.Plot <- ToPlotNumSig %>%
  mutate(ContainsOutliers=CombinationNumber %in% ClusterOneCombinationsWithOutliers) %>%
ggplot(aes(x=NumSigGenes)) +
  # stat_ecdf(color=ClusterGroupColors$Key[1]) +
  geom_density(aes(fill=Cluster1.Combinations, linetype=ContainsOutliers), alpha=0.9) +
  geom_vline(data=NumResults, size=1, aes(xintercept=NumSigGenes, color=ClusterNumber)) +
  facet_wrap(~FDR, scales="free_y") +
  ylab("empirical density") +
  xlab("Num DE genes") +
  scale_colour_manual(name = "Filtered\nkinship clusters", values = ClusterGroupColors$Key, labels=NULL) +
  scale_fill_manual(name = "All nCr(13,4)=715\ncombinations",values = ClusterGroupColors$Key, labels=NULL) +
  scale_linetype_manual(name="Combinations that contain samples\n'Little_R' or '537'", values=c("solid", "dashed")) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45), legend.position = "bottom") +
    guides(color = guide_legend(title.position = "top"),
         linetype = guide_legend(title.position = "top"),
         fill = guide_legend(title.position = "top"))
NumDE.Plot

FDR.Plot <- ToPlotFDR %>%
  mutate(ContainsOutliers=CombinationNumber %in% ClusterOneCombinationsWithOutliers) %>%
  ggplot(aes(x=NumSigGenes)) +
  # stat_ecdf(color=ClusterGroupColors$Key[1]) +
  geom_density(aes(x=NumSigGenes, ..density.., fill=Cluster1.Combinations, linetype=ContainsOutliers), alpha=0.9) +
  geom_vline(data=FDR.Results, size=1, aes(xintercept=NumSigGenes, color=ClusterNumber)) +
  facet_wrap(~FDR, scales="free_y") +
  ylab("empirical density") +
  xlab("Empirical FDR estimate") +
  scale_colour_manual(name = "Filtered\nkinship clusters",values = ClusterGroupColors$Key, labels=NULL) +
  scale_fill_manual(name = "All nCr(13,4)=715\ncombinations",values = ClusterGroupColors$Key, labels=NULL) +
  scale_linetype_manual(name="Combinations that contain samples\n'Little_R' or '537'", values=c("solid", "dashed")) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45), legend.position = "bottom") +
  guides(color = guide_legend(title.position = "top"),
         linetype = guide_legend(title.position = "top"),
         fill = guide_legend(title.position = "top"))
FDR.Plot


ggsave("../figures/OriginalArt/ResponseToReviewers.DE.PairwiseKinship.pdf", plot=PairwiseKinshipBoxplot, height=3, width=2.5)
ggsave("../figures/OriginalArt/ResponseToReviewers.DE.NumDE.pdf", plot=NumDE.Plot, height=3, width=6.5)
ggsave("../figures/OriginalArt/ResponseToReviewers.DE.FDR.pdf", plot=FDR.Plot, height=3, width=6.5)

Also, let’s use the variance partition to describe how much gene expression variance (among the chimp samples only) can be explained by factors like sex, batch, RIN, and the kinship cluster groups. Also turning eval=F because variance partition takes a few minutes. But the figures generated will be published in the paper or as part of the public response to reviewers.

info <- read.delim("../output/Final/TableS1.tab", stringsAsFactors = F) %>%
  dplyr::select(Alternate.ID, IID=Sample.ID, SX, SP, UniquelyMappingReads, RNA.Extraction.Batch, RIN, PercentUniquelyMappingReads) %>%
  mutate(SP=recode(SP, Human="H", Chimp="C")) %>%
  filter(SP=="C") %>%
  inner_join(Groups.df, by=c("IID"="Ind")) %>%
  mutate(Individual=paste0("C.", IID))

gExpr <- cbind(CountTables$Chimp$Counts) %>%
  dplyr::select(one_of(info$Individual)) %>%
  DGEList() %>%
  calcNormFactors
vobjGenes <- voom(gExpr)

info <- info %>%
  filter(Individual %in% colnames(gExpr)) %>%
  column_to_rownames("Individual")


form <- ~  RIN + SX + RNA.Extraction.Batch + groups

varPart <- fitExtractVarPartModel( vobjGenes, form, info )

# sort variables (i.e. columns) by median fraction # of variance explained
vp <- sortCols( varPart )

# violin plot of contribution of each variable to total variance
plotVarPart( vp )



info <- read_excel("../data/Metadata_SequencedChimps.xlsx") %>%
  dplyr::select(IID=1, SX, RNA.Extraction.Batch=RNA.Extract_date, RNA.Library.prep.batch, RIN) %>%
  mutate(RNA.Library.prep.batch=as.factor(RNA.Library.prep.batch)) %>%
  # inner_join(Groups.df, by=c("IID"="Ind")) %>%
  inner_join(SamplesToDrawFrom, by=c("IID"="Ind")) %>%
  dplyr::rename(groups=Cluster) %>%
  mutate(groups=as.factor(groups)) %>%
  mutate(Individual=paste0("C.", IID))
gExpr <- cbind(CountTables$Chimp$Counts) %>%
  dplyr::select(one_of(info$Individual)) %>%
  DGEList() %>%
  calcNormFactors
vobjGenes <- voom(gExpr)

info <- info %>%
  filter(Individual %in% colnames(gExpr)) %>%
  column_to_rownames("Individual")


# form <- ~  RIN + (1|SX) + (1|RNA.Extraction.Batch)  + (1|groups)
# varPart <- fitExtractVarPartModel( vobjGenes, form, info )

form <- ~  (1|SX) + (1|RNA.Extraction.Batch)  + (1|groups)
varPart <- fitExtractVarPartModel( vobjGenes, form, info )


# sort variables (i.e. columns) by median fraction # of variance explained
vp <- sortCols( varPart )

# violin plot of contribution of each variable to total variance
VarPartPlot <- plotVarPart( vp ) +
  scale_x_discrete(labels=c("SX" = "Sex", "groups"="Filtered\nkinship clusters"))

f <- function(x) {
  r <- quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
  names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
  r
}

VarPartPlot <- vp %>%
  gather(key="factor", value="Variance explained (%)") %>%
  ggplot(aes(x=fct_reorder(factor, `Variance explained (%)`), y=`Variance explained (%)`)) +
  stat_summary(fun.data=f, geom="boxplot") + 
  scale_x_discrete(labels=c("SX" = "Sex", "groups"="Filtered\nkinship clusters")) +
  xlab("") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


ggsave("../figures/OriginalArt/ResponseToReviewers.DE.KinshipVarPart.pdf", VarPartPlot, height=3.5, width=2.5)

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] cowplot_1.0.0            gridExtra_2.3            MASS_7.3-51.4           
 [4] knitr_1.26               pbapply_1.4-3            edgeR_3.26.8            
 [7] variancePartition_1.14.1 Biobase_2.44.0           BiocGenerics_0.30.0     
[10] foreach_1.4.7            limma_3.40.6             scales_1.1.0            
[13] readxl_1.3.1             gplots_3.0.1.1           forcats_0.4.0           
[16] stringr_1.4.0            dplyr_0.8.3              purrr_0.3.3             
[19] readr_1.3.1              tidyr_1.0.0              tibble_2.1.3            
[22] ggplot2_3.2.1            tidyverse_1.3.0         

loaded via a namespace (and not attached):
 [1] nlme_3.1-143        bitops_1.0-6        fs_1.3.1           
 [4] pbkrtest_0.4-8.6    lubridate_1.7.4     doParallel_1.0.15  
 [7] progress_1.2.2      httr_1.4.1          rprojroot_1.3-2    
[10] tools_3.6.1         backports_1.1.5     R6_2.4.1           
[13] KernSmooth_2.23-16  DBI_1.0.0           lazyeval_0.2.2     
[16] colorspace_1.4-1    withr_2.1.2         prettyunits_1.0.2  
[19] tidyselect_0.2.5    compiler_3.6.1      git2r_0.26.1       
[22] cli_2.0.0           rvest_0.3.5         xml2_1.2.2         
[25] labeling_0.3        caTools_1.17.1.3    digest_0.6.23      
[28] minqa_1.2.4         rmarkdown_1.18      colorRamps_2.3     
[31] pkgconfig_2.0.3     htmltools_0.4.0     lme4_1.1-23        
[34] highr_0.8           dbplyr_1.4.2        rlang_0.4.1        
[37] rstudioapi_0.10     farver_2.0.1        generics_0.0.2     
[40] jsonlite_1.6        BiocParallel_1.18.1 gtools_3.8.1       
[43] magrittr_1.5        Matrix_1.2-18       Rcpp_1.0.5         
[46] munsell_0.5.0       fansi_0.4.0         lifecycle_0.1.0    
[49] stringi_1.4.3       whisker_0.4         yaml_2.2.0         
[52] plyr_1.8.5          grid_3.6.1          gdata_2.18.0       
[55] promises_1.1.0      crayon_1.3.4        lattice_0.20-38    
[58] haven_2.2.0         splines_3.6.1       hms_0.5.2          
[61] locfit_1.5-9.1      zeallot_0.1.0       pillar_1.4.2       
[64] boot_1.3-23         reshape2_1.4.3      codetools_0.2-16   
[67] reprex_0.3.0        glue_1.3.1          evaluate_0.14      
[70] modelr_0.1.5        vctrs_0.2.0         nloptr_1.2.2.2     
[73] httpuv_1.5.2        cellranger_1.1.0    gtable_0.3.0       
[76] assertthat_0.2.1    xfun_0.11           broom_0.5.2        
[79] later_1.0.0         iterators_1.0.12    workflowr_1.5.0    
[82] statmod_1.4.34      ellipsis_0.3.0