Comparing Different EBV Prediction Methods


  A bash script is outlined below that illustrates how to loop through different simulation scenarios. The following script may take a couple of hours to finish depending on the computing environment. The difference between the scenarios is how estimated breeding values (EBV) are estimated across individuals. The four methods to predict EBV are: a pedigree-based best linear unbiased prediction (BLUP) animal model, a genomic-based BLUP animal model, a run-of-homozygosity-based BLUP animal model and a Bayesian spike and slab marker effects model (BayesC). All three scenarios utilize the same founder sequence, which is generated by first using the 'sequence' option for the "START" parameter and then the remaining scenarios use the 'founder' option for the "START" parameter. In order to generate multiple replicates within each scenario, the "NREP" parameter is utilized and 15 replicates were generated within each scenario. The replicates are saved within the "GenoDiverFiles" directory for each scenario. If the directory isn't renamed, the next scenario will replace the files. Therefore, after a scenario is done the directory is renamed within the bash script.

##--------------------------------------------------------------------
## First generate parameter file. Put parameters to change at the very
## bottom of the file. That way you can use the 'head' linux command
## for the parameters that change and use the 'echo' linux command to
## add any new parameters.
##--------------------------------------------------------------------
## In case 4_Example.txt is still a file ##
rm -rf Example12.txt || TRUE

# Parameters that are the same #
echo '-| General |-' >> Example12.txt
echo 'SEED: 1500' >> Example12.txt
echo 'NREP: 15' >> Example12.txt
echo '−| Genome & Marker |−' >> Example12.txt
echo 'CHR: 3' >> Example12.txt
echo 'CHR_LENGTH: 150 150 150' >> Example12.txt
echo 'NUM_MARK: 4000 4000 4000' >> Example12.txt
echo 'QTL: 200 200 200' >> Example12.txt
echo '−| Population |−' >> Example12.txt
echo 'FOUNDER_Effective_Size: Ne100_Scen1' >> Example12.txt
echo 'MALE_FEMALE_FOUNDER: 50 400 random 5' >> Example12.txt
echo 'VARIANCE_A: 0.35' >> Example12.txt
echo '−| Selection |−' >> Example12.txt
echo 'GENERATIONS: 10' >> Example12.txt
echo 'INDIVIDUALS: 50 0.2 400 0.2' >> Example12.txt
echo 'PROGENY: 1' >> Example12.txt
echo 'SELECTION: ebv high' >> Example12.txt
echo 'CULLING: ebv 8' >> Example12.txt
echo 'MATING: random125 simu_anneal' >> Example12.txt

##--------------------------------------------------------------------
## First do PBLUP based EBV prediction
##--------------------------------------------------------------------
echo 'START: sequence' >> Example12.txt
echo 'EBV_METHOD: pblup' >> Example12.txt
## Run GenoDiver ##
./GenoDiver Example12.txt
## rename replicate output to rep_pblup ##
mv ./GenoDiverFiles/replicates ./GenoDiverFiles/reps_pblup

##--------------------------------------------------------------------
## Then do GBLUP based EBV prediction
##--------------------------------------------------------------------
## Remove parameters that are changing ##
head -n 19 Example12.txt > Example12a.txt
mv ./Example12a.txt ./Example12.txt
## echo in new paramters ##
echo 'START: founder' >> Example12.txt
echo 'EBV_METHOD: gblup' >> Example12.txt
## Run GenoDiver ##
./GenoDiver Example12.txt
## rename replicate output to rep_gblup ##
mv ./GenoDiverFiles/replicates ./GenoDiverFiles/reps_gblup

##--------------------------------------------------------------------
## Then do ROHBLUP based EBV prediction
##--------------------------------------------------------------------
## Remove parameters that are changing ##
head -n 19 Example12.txt > Example12a.txt
mv ./Example12a.txt ./Example12.txt
## echo in new paramters ##
echo 'START: founder' >> Example12.txt
echo 'EBV_METHOD: rohblup' >> Example12.txt
## Run GenoDiver ##
./GenoDiver Example12.txt
## rename replicate output to rep_rohblup ##
mv ./GenoDiverFiles/replicates ./GenoDiverFiles/reps_rohblup

##--------------------------------------------------------------------
## Then do BAYESC based EBV prediction
##--------------------------------------------------------------------
## Remove parameters that are changing ##
head -n 19 Example12.txt > Example12a.txt
mv ./Example12a.txt ./Example12.txt
## echo in new parameters
echo 'START: founder' >> Example12.txt
echo 'EBV_METHOD: bayes' >> Example12.txt
echo 'BAYESOPTIONS: BayesC 10000 2000 6000 1000 5 fix 0.10' >> Example12.txt
## Run GenoDiver ##
./GenoDiver Example12.txt
## rename replicate output to rep_bayesC ##
mv ./GenoDiverFiles/replicates ./GenoDiverFiles/reps_bayesC

Parameter File Summary
  Sequence information is generated for three chromosomes with a length of 150 Megabases. The genome simulated has a high degree of short-range LD (Ne100_Scen1). The SNP panel contains 12,000 markers (i.e. 4,000 markers per chromosome). For each chromosome, 200 randomly placed QTL and zero FTL mutations were generated. The quantitative trait simulated has a narrow sense heritability of 0.35 and only additive effects are generated (i.e. no dominance). The phenotypic variance is by default set at 1.0, and therefore the residual variance is 0.65. The founder population consisted of 50 males and 400 females. For each generation, a total of 50 males and 400 females are in the population. A total of 10 and 80 (0.2 replacement rate) male and female parents, respectively, are culled and replaced by new progeny each generation. Random selecton of progeny and culling of parents was conducted for 5 generations. After 5 generations, animals with a high EBV were selected or culled each generation. The EBV are estimated using a pedigree-based BLUP, genomic-based BLUP, ROH-based BLUP or Bayesian spike and slab marker effects model (BayesC). Each mating pair produced one progeny. Parents that had pedigree-based relationships greater than 0.125 were avoided, and this was optimized based on the simulated annealing method.

  Outlined below is a more detailed explanation of the major differences in how EBVS are calculated across the different scenarios:
  • pblup: The EBV are estimated based on a relationship matrix derived from pedigree information. All animals are utilized when generating the relationship matrix.
  • gblup: The EBV are estimated based on a relationship matrix derived from genomic information (G). When generating the relationship matrix all animals are utilized along with all of the markers. Therefore, the assumption is that all markers have an impact on the phenotype. Furthermore, allele frequencies utilized to construct G are estimated from the founder population.
  • rohblup: The EBV are estimated based on a relationship matrix derived from runs-of-homozygosity (ROH). When generating the relationship matrix all animals are utilized along with all of the markers. Therefore, the assumption is that all markers have an impact on the phenotype.
  • bayes: The EBV are estimated based on bayesian marker effects model, which estimates the effect of each marker utilizing an MCMC approach. An animals EBV is generated by the summation of an animals genotype times the estimated marker effect across all markers. When the 'bayes' option is specified, the "BAYESOPTIONS" parameter must be included in the parameter file. The model used is referred to as BayesC, which is a spike and slab marker effects model. The 5 values after specifying which type of bayesian model to implement refers to how many iterations to run the MCMC chain and the thinning rate. The first time the BayesC model is implemented (i.e. Generation 6), a total of 10,000 MCMC iterations are generated with the first 2,000 discarded as burn-in. The following generations then use the marker effects from the previous generation as starting values and a total of 6,000 MCMC iterations are generated with the first 1,000 discarded as burn-in. Lastly, a thinning rate of 5 is utilized. In BayesC it assumed only a proportion of the markers have a non-null effect, while the remaining don't have an effect. In this case we fixed the proportion and we assumed 10 % of the markers have an effect. One of the downfalls with MCMC sampling is that it does increase the computation time. Therefore you will notice that when using the bayes options the computation time will dramatically increase.
  The important files are saved within the renamed replicate folder within each scenario. The replicate number is appended to the file name. The bash script above renames the replicate directories to "reps_pblup", "reps_gblup", "reps_rohblup" and "reps_bayesC" for the pblup, gblup, rohblup and bayes EBV estimation scenarios, respectively. A screenshot of the files within the "GenoDiverFiles" is below. The directories are in blue and the files are in white.


  Utilizing the R code outlined below the following plot was generated to illustrate how to loop through each scenario and generate plots that describe the impact of EBV estimation method on the genetic trend.

R-Code
rm(list=ls()); gc()
library(ggplot2); library(tidyverse)
## Change
wd <- "/Users/jeremyhoward/Documents/39_GenoDiver_C++Code/WebsiteExamples/Example12/"
scen <- c("reps_pblup","reps_gblup","reps_rohblup","reps_bayesC") ## the directory name for each scenario ##
reps <- c(1500:1514) ## Number of replicates simulated ##
##################################################################
## Loop through and grab metric across scenarios and replicates ##
##################################################################
for(i in 1:length(scen))
{
for(j in 1:length(reps))
{
    filename <- paste(wd,scen[i],"/Summary_Statistics_DataFrame_Performance_",reps[j],sep="")
    df <- read_table2(file=filename,col_names = TRUE,col_type = "dcccccc") %>%
        mutate(tgv = as.numeric(matrix(unlist(strsplit(tgv, "[()]")), ncol = 2, byrow = TRUE)[, 1]),
                     Method = unlist(strsplit(scen[i],"_"))[2], Rep = reps[j]) %>%
        select(Generation,Method,Rep,tgv)
    if(j == 1 & i == 1){summarytable <- df}
    if(j > 1 | i > 1){summarytable <- rbind(summarytable,df);}
}
}
## generate mean and sd by generation and method
means <- aggregate(tgv ~ Generation + Method, data=summarytable,FUN=mean)
sds <- aggregate(tgv ~ Generation + Method, data=summarytable,FUN=sd)
#################################################
## Plot Genetic Trend across Different Methods ##
#################################################
plotdf <- cbind(means,sds[,3]); rm(means,sds)
names(plotdf) <- c("Generation","Method","Mean","SD")
pd <- position_dodge(0.20)

ggplot(plotdf, aes(x=as.factor(Generation), y=Mean, group=Method, colour=Method)) +
geom_errorbar(aes(ymin=Mean-SD, ymax=Mean+SD), colour="black", width=.4, size = 0.5, position=pd) +
geom_point(size=2.0) + geom_line(size=0.50) + theme_bw() +
labs(title = "Genetic Trend \n(+/- 1 SD)", x = "Generation", y = "Mean True Breeding Value") +
theme(plot.title = element_text(size = 16,hjust = 0.5),axis.title = element_text(size = 12),
legend.position="bottom",axis.text=element_text(size=10))