Package ProbBreed

Calculating the risk of cultivar recommendation in multi-environment trials

1 Introduction

Welcome to ProbBreed! This guide will help you using the package’s function, from fitting a Bayesian model to computing probabilities. Feel free to contact us if there are any issues via GitHub or e-mail. Further details are found within each function documentation (e.g., ?bayes_met) and in the original paper by Dias et al. (2022).

2 Issues with rstan

Long story short: install rstan using the tutorial below before installing ProbBreed!

ProbBreed internally uses rstan (Stan Development Team, 2023a), the R interface to Stan (Stan Development Team, 2023b), to fit Bayesian models. We noticed that the version of rstan that is currently on CRAN may not work properly. Several on-line tutorials offer solutions to overcome this problem. A solution that worked for us and other colleagues is in this link. Below, we will summarize it in four steps.

2.1 Step one: Update R

We suggest keeping R up-to-date. We will describe a solution for overcoming the rstan issue in the newer versions of R (4.3.#). The link above has solutions for older versions.

2.2 Step two: Install Rtools

If you already have an older Rtools version installed on your machine, you may need to uninstall it and manually delete the folder where its data is stored. Then, identify the installed R version and download the proper Rtools using this link. The Rtools for R 4.3 is available at this link.

2.3 Step three: Install rstan, StanHeaders and BH

Use the code below to install rstan 2.26, and the BH package:

install.packages("StanHeaders", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
install.packages("rstan", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
install.packages('BH')

2.4 Step four: Install ProbBreed

Once rstan is properly installed, ProbBreed can be installed and used without further concerns:

install.packages("ProbBreed")

or, for the development version:

# install.packages("devtools")
devtools::install_github("saulo-chaves/ProbBreed")

2.5 Issues in macOs and Linux

For specific issues when using ProbBreed in macOs and Linux, refer to the following links:

  • For macOs

Link 1

Link 2

Link 3

  • For Linux

Link 1

Link 2

If these links do not help you, feel free to contact us or open an issue in rstan forum.

3 Vignette

3.1 Step one - load ProbBreed

library(ProbBreed)

3.2 Step two - bayes_met

The first step is to fit a multi-environment Bayesian model using the bayes_met function. This function has a set of predefined models that are used to run Bayesian analyses using rstan. There are some important details that needs to be stressed:

1- Currently, the function has twelve options of models (Figure 1).

Figure 1: Options to declare replications and/or blocks (repl), years (year) and regions (reg) effects in the bayes_met function. Users must substitute Repl, Block, Year, and Region with the name of the column that contains the information about replicates, block nested in replicates (if applicable), year (if available) and region (if available). RCDB and IBD are acronyms for randomized complete block design and incomplete block design, respectively.

These models differ according to the considered information: only from locations, locations and years, environments (location-year combinations) and breeding regions, and locations, years and breeding regions. Should you consider an “environment” as a combination of environmental factors (for instance, location and planting dates within location), this information must be in the loc argument. All models will have gen and loc. If you want to consider the effect of regions (or mega-environment) and/or time factors (for instance, years or harvests), then reg and year must represent the column that contains the information on these effects. Otherwise, these arguments are set to NULL. You may control the experimental design effects in the repl argument. If repl = NULL, bayes_met will assume that you are entering with adjusted means of each environment, i.e., data without replicates. If you have data from trials laid out in randomized complete blocks, repl = 'Replicate'. Finally, if the multi-environment trials were laid out in incomplete blocks design, repl = c('Replicate', 'Block').

2- You may change the number of iterations, chains, and cores. This will vary according to the data and the processing capacity of your machine. The default is set 2000 iterations (including warm-up), 2 cores and 4 chains. Be aware that the more iterations and chains, the longer the computation time. At the same time, results are likely more reliable, and mixing/convergence issues will diminish. By default, if more than one core is set, the function runs one chain per core. The number of cores depends on the processing capacity of your machine. Choose wisely.

3- Sometimes, bayes_met may yield warnings on possible mixing/convergence issues. More details on these problems and how to deal with them are available here. Usually, increasing the number of iterations is enough. If this does not work, bayes_met has several arguments passed to rstan::sampling() in which advanced users can configure the default sampling parameters (more details on bayes_met documentation: help("bayes_met")), and hopefully, diminish these issues. Nevertheless, before taking any measures - or even disposing of the model - we recommend doing the posterior predictive check using the extr_outs function (see Step three). Checking the \(\hat{R}\) statistic, Bayesian p-values and Empirical vs Sampled density plot is particularly useful. If bayes_met showed warnings but the posterior predictive checks do not indicate serious issues, you can rely on the results and proceed with the analysis.

4- You can choose between a model with homogeneous or heterogeneous residual variances using res.het = FALSE and res.het = TRUE, respectively. When res.het = TRUE, there will be a variance for each location.

5- The Bayesian models implemented in ProbBreed have predefined priors described by Dias et al. (2022). In summary: \[ x \sim N(0,S^{[x]}) \] \[ \sigma \sim HalfCauchy(0, S^{[\sigma]}) \]

where \(x\) can be any effect but the residual, and \(\sigma\) is the standard deviation of the likelihood. If an heterogeneous model is set (res.het = TRUE), then \(\sigma_k \sim HalfCauchy(0, S^{[\sigma_k]})\).

The hyperpriors are set as follows: \[ S^{[x]} \sim HalfCauchy(0, \phi) \] where \(\phi\) is the known global hyperparameter defined such as \(\phi = max(y) \times 10\). By doing this, we restrain the hyperparameter to be finite and allow the model to take full advantage of the data to infer the posterior distribution since it provide weakly informative prior distributions, avoiding subjectivity that can possibly hamper the results. The half-Cauchy distribution was used to guarantee that variance components will not be negative.

Let us see how the twelve models can be fitted using bayes_met(). The following models are only for educational purposes, and will not provide a valid output. We will use the terms “location”, “region” and “years” to avoid the possible confusion caused by the term “environment”, which can have several meanings.

3.2.1 Locations: only means

\[ y_{jk} = \mu + l_k + g_j + gl_{jk} + \varepsilon_{jk} \]

where the \(y_{jk}\) is the phenotypic record of the \(j^{\text{th}}\) genotype in the \(k^{\text{th}}\) location, \(\mu\) is the intercept, \(l_k\) is the main effects of the \(k^{\text{th}}\) location, \(g_j\) is the main effect of the \(j^{\text{th}}\) genotype, \(gl_{jk}\) is the interaction between these two effects, and \(\varepsilon_{jk}\) is the residual effect.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = NULL,
          year = NULL,
          reg = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.2 Locations: randomized complete block design

\[ y_{jkp} = \mu + l_k + b_{p(k)} + g_j + gl_{jk} + \varepsilon_{jkp} \]

where the \(y_{jkp}\) is the phenotypic record of the \(j^{\text{th}}\) genotype, allocated in the \(p^{\text{th}}\) block, in the \(k^{\text{th}}\) location. All other effects were previously defined but \(b_{p(k)}\), which is the effect of the \(p^{\text{th}}\) block in the \(k^{\text{th}}\) location.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          year = NULL,
          reg = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.3 Locations: incomplete block design

A lattice, for example:

\[ y_{jkqp} = \mu + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + \varepsilon_{jkqp} \]

where the \(y_{jkqp}\) is the phenotypic record of the \(j^{\text{th}}\) genotype, allocated in the \(p^{\text{th}}\) block of the \(q^{\text{th}}\) replicate, in the \(k^{\text{th}}\) location. All other effects were previously defined but \(r_{q(k)}\), which is the effect of the \(q^{\text{th}}\) replicate in the \(k^{\text{th}}\) location. Note that the indices of \(b\) also change.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = NULL,
          year = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.4 Regions and locations: only means

\[ y_{jkw} = \mu + m_w + l_k + g_j + gl_{jk} + gm_{jw} + \varepsilon_{jkw} \]

All effects were previously defined but \(m_w\) and \(gm_{jw}\), which are the main effects of region and genotypes-by-regions interaction, respectively.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = NULL,
          year = NULL,
          res.het = FALSE,
          reg = "Region",
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.5 Regions and locations: randomized complete block design

\[ y_{jkwp} = \mu + m_w + l_k + b_{p(k)} + g_j + gl_{jk} + gm_{jw} + \varepsilon_{jkwp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          reg = "Region",
          year = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.6 Regions and Environments: incomplete block design

\[ y_{jkwqp} = \mu + m_w + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + gm_{jw} + \varepsilon_{jkwqp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = "Region",
          year = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4) 

3.2.7 Years and locations: only means

\[ y_{jkh} = \mu + t_h + l_k + g_j + gl_{jk} + gt_{jh} + \varepsilon_{jkh} \] where \(t_h\) and \(t_{jh}\) are the main effect of years and the genotypes-by-years interaction effect, respectively.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = NULL,
          year = 'Year',
          res.het = FALSE,
          reg = NULL,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.8 Years and locations: randomized complete block design

\[ y_{jkhp} = \mu + t_h + l_k + b_{p(k)} + g_j + gl_{jk} + gt_{jh} + \varepsilon_{jkhp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          reg = NULL,
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.9 Years and locations: incomplete block design

\[ y_{jkhqp} = \mu + t_h + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + gt_{jh} + \varepsilon_{jkhqp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = NULL,
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4) 

3.2.10 Years, regions and locations: only means

\[ y_{jkhw} = \mu + t_h + m_w + l_k + g_j + gl_{jk} + gt_{jh} + gm_{jw} + \varepsilon_{jkhw} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = NULL,
          year = 'Year',
          reg = 'Region',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.11 Years, regions and locations: randomized complete block design

\[ y_{jkhwp} = \mu + t_h + m_w + l_k + b_{p(k)} + g_j + gl_{jk} + gt_{jh} + gm_{jw} + \varepsilon_{jkhwp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          reg = "Region",
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

3.2.12 Years, regions and locations: incomplete block design

\[ y_{jkhwqp} = \mu + t_h + m_w + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + gt_{jh} + gm_{jw} + \varepsilon_{jkhwqp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = "Region",
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4) 

3.3 Example dataset

For the next steps, we will use the soy data, which is contained within the package. This dataset belongs to the USDA Northern Region Uniform Soybean Tests, and it is a subset of the data from Krause et al. (2023). It contains the empirical best linear unbiased estimates of genotypic means of the seed yield from 39 experimental genotypes evaluated in 14 locations across three regions or mega-environments. The full dataset can be found in the SoyURT package. Note that we do not have experimental design in this case, so the correct model has repl = NULL. In addition, the dataset belongs to a single year, so year = NULL.

mod = bayes_met(data = soy,
                gen = "Gen",
                loc = "Loc",
                repl = NULL,
                reg = "Reg",
                year = NULL,
                res.het = FALSE,
                trait = "Y",
                iter = 10000, 
                cores = 4, 
                chains = 4)

ProbBreed uses rstan to fit Bayesian models, so packages and functions designed to manage and explore these models can be used in the output object of the bayes_met function. Some interesting options are rstantools, shinystan, and bayesplot.

3.4 Step three - extr_outs

After fitting the model, the next step is to extract outputs. The extr_outs() function is a requisite for estimating the probability of superior performance using other functions. extr_outs() also provides important diagnostics regarding the convergence and reliability of the model.

Using the mod object from the previous step:

outs = extr_outs(data = soy, 
                 trait = 'Y', 
                 model = mod,
                 probs = c(0.05, 0.95), 
                 check.stan.diag = FALSE, 
                 verbose = TRUE)

where:

  • model is the model fitted using bayes_met

  • probs are the probabilities considered to calculate the quantiles and the highest posterior density (HPD)

  • check.stan.diag asks whether you want (TRUE, the default) diagnostic plots provided by rstan::stan_diag() or not (FALSE).

This function provides a list with the posterior distribution of each effect, the data generated by the model, the variances of each effect, the maximum posterior values of each effect, and quality parameters of the model (See below):

outs$variances
    effect      var        sd naive.se HPD_0.05 HPD_0.95
1 genotype    3.284     1.417    0.006    1.331    5.864
2 location  255.655   136.307    0.556  116.515  503.423
3  gen.loc    6.989     5.019    0.020    0.309   15.430
4   region 4150.795 64757.982  264.373    0.883 9695.849
5  gen.reg    1.125     1.087    0.004    0.012    3.261
6    error   11.086     5.016    0.020    2.974   18.346
outs$ppcheck
                  Diagnostics
p.val_max              0.9711
p.val_min              0.2730
p.val_median           0.6677
p.val_mean             0.5066
p.val_sd               0.5158
Eff_No_parameters    125.8434
WAIC2               2554.2142
mean_Rhat              1.0172
Eff_sample_size        0.1165

It also provides a list containing built-in histograms, density plots, and trace plots of each effect, built using the ggplot2 package (Wickham, 2016). For example, the density plot comparing the generated data by the Bayesian model and the real data.

outs$plots$densities$sampled.y

Another example is the histogram of the genotypic main effects:

outs$plots$histogram$genotype

These plots are available for all effects declared in extr_outs(). Feel free to explore it. As an option, the function also provides diagnostic plots from rstan::stan_diag(). In fact, extr_outs lets you use further options provided by the function. For more information please refer to ?rstan::stan_diag.

outs = extr_outs(data = soy, 
                 trait = 'Y', 
                 model = mod,
                 res.het = FALSE, 
                 check.stan.diag = TRUE, 
                 verbose = TRUE)

3.5 Step four - prob_sup

With the outputs extracted by extr_outs(), we can calculate the probabilities of superior performance and superior stability of the evaluated genotypes with prob_sup():

results = prob_sup(data = soy, 
                   trait = "Y", 
                   gen = "Gen", 
                   loc = "Loc",
                   mod.output = outs, 
                   reg = 'Reg', 
                   year = NULL,
                   int = .2,
                   increase = TRUE, 
                   save.df = FALSE, 
                   interactive = FALSE, 
                   verbose = FALSE)

The arguments data, trait, gen, loc, year and reg were previously described. The new arguments of this function are:

  • increase: Increasing (TRUE, default) or decreasing (FALSE) the trait mean

  • mod.output: An object that contains the outputs extracted by extr_outs().

  • int: The selection intensity, expressed in decimal values.

  • save.df: TRUE if you want to save the data frames containing the calculated probabilities in the work directory, FALSE otherwise.

  • interactive: TRUE if you want to convert the ggplots into ggplotlys (Sievert, 2020), i.e., make them interactive plots, FALSE otherwise.

The outputs of this function are divided into two lists: marginal and conditional. The first list contains the probabilities of superior performance and superior stability across environments (marginal), while the second contains the probabilities of superior performance within environments (conditional). Both lists have the pairwise probabilities of both superior performance and stability.

3.5.1 Marginal probabilities

The marginal list contains two other lists: one with data frames detailing the results of each computed probability, and another with ggplots illustrating the computed probabilities. Below, we detail each result:

3.5.1.1 Highest posterior density of genotypic main effects

This metric is illustrated in a caterpillar plot (Figure 2):

results$marginal$plots$g_hpd

Figure 2: Caterpillar plot

3.5.1.2 Probability of superior performance

Let \(\Omega\) be a subset of the high-performance selected genotypes according to the intensity of selection. A given genotype \(j\) will belong to \(\Omega\) if its genotypic marginal value (\(\hat{g}_j\)) is high (or low) enough compared to its peers. We can emulate the occurrence of \(S\) trials (\(s = 1, 2, \dots, S\)) by leveraging discretized samples of Monte Carlo posterior distributions of the fitted Bayesian models. Then, the probability of the \(j^{\text{th}}\) genotype belonging to \(\Omega\) is its ratio of success (\(\hat{g}_j \in \Omega\)) events over the total number of sampled events [\(S = (\hat{g}_j \in \Omega) + (\hat{g}_j \notin \Omega)]\), as follows:

\[ Pr(\hat{g}_j \in \Omega \vert y) = \frac{1}{S} \sum_{s=1}^S I(\hat{g}_j^{(s)} \in \Omega \vert y) \] where \(I(\hat{g}_j^{(s)} \in \Omega \vert y)\) is an indicator variable that can assume two values: (1) if \(\hat{g}_j \in \Omega\) in the \(s^{\text{th}}\) sample, and (0) otherwise. The results provided by prob_sup() can be accessed using the following commands:

head(results$marginal$df$perfo)
    ID    prob
36 G36 0.94545
31 G31 0.82250
20 G20 0.79905
9  G09 0.66595
38 G38 0.59905
1  G01 0.54605
results$marginal$plots$perfo

Figure 3: Marginal probability of superior performance

3.5.1.3 Pairwise probability of superior performance

To directly compare a selection candidate with a commercial check, or another promising genotype, we can calculate the pairwise probability of superior performance. This metric computes the probability of a genotype \(j\) performing better than a genotype \(j^\prime\):

\[ Pr(\hat{g}_{j} > \hat{g}_{j^\prime} \vert y) = \frac{1}{S} \sum_{s=1}^S I(\hat{g}_{j}^{(s)} > \hat{g}_{j^\prime}^{(s)} \vert y) \] Note that this equation applies if the direction of the selection is to increase the trait value. If the aim is to decrease it (for example, plant height or the susceptibility of a disease), you can set increase = F, and the following equation is employed:

\[ Pr(\hat{g}_{j} < \hat{g}_{j^\prime} \vert y) = \frac{1}{S} \sum_{s=1}^S I(\hat{g}_{j}^{(s)} < \hat{g}_{j^\prime}^{(s)} \vert y) \] The results are accessed using the following commands:

head(results$marginal$df$pair_perfo)
    x    prob   y
2 G01 0.97360 G02
3 G01 0.77295 G03
4 G01 0.95590 G04
5 G01 0.99120 G05
6 G01 0.74290 G06
7 G01 0.74170 G07
results$marginal$plots$pair_perfo

Figure 4: Marginal pairwise probability of superior performance

The results indicates the probability of genotypes at the x-axis performing superiorly in relation to those at the y-axis.

3.5.1.4 Probability of superior stability

Considering genotypes that have low genotype-by-environment interaction variance [\(var(ge_{jk})\)] as stable, the probability of superior stability is given by the following equation:

\[ Pr[var(\widehat{ge}_{jk}) \in \Omega \vert y] = \frac{1}{S} \sum_{s=1}^S I[var(\widehat{ge}_{jk}^{(s)}) \in \Omega \vert y] \] where \(I[var(\widehat{ge}_{jk}^{(s)}) \in \Omega \vert y]\) indicates if \(var(\widehat{ge}_{jk}^{(s)})\) exists in \(\Omega\) (1) or not (0) for the \(j^{th}\) genotype on the \(k^{th}\) environment. Note that this probability can only be computed across environments since it depends on \(var(\widehat{ge}_{jk})\). Pairwise probabilities of superior stability are also possible in this context:

  • Genotype-by-location interaction
head(results$marginal$df$stabi_gl)
    ID    prob
23 G23 0.47675
34 G34 0.43345
35 G35 0.38025
8  G08 0.30390
27 G27 0.30015
13 G13 0.29990
results$marginal$plots$stabi_gl

Figure 5: Marginal probability of superior stability across environments (locations)
  • Genotype-by-region interaction
head(results$marginal$df$stabi_gm)
    ID    prob
15 G15 0.23735
32 G32 0.23725
34 G34 0.23690
16 G16 0.23250
1  G01 0.23210
23 G23 0.23175
results$marginal$plots$stabi_gm

Figure 6: Marginal probability of superior stability across regions

3.5.1.5 Pairwise probability of superior stability

We can also compare the stability of two selection candidates, or a experimental genotype and a commercial check for example. The following equation is used:

\[ Pr[var(\widehat{ge_{jk}}) < var(\widehat{ge}_{ik}) \vert y] = \frac{1}{S} \sum_{s=1}^S{I [var(\widehat{ge}_{jk}^{(s)}) < var(\widehat{ge}_{ik}^{(s)}) \vert y]} \]

Note that, in this case, we are interested in the genotype that has a lower variance of the genotype-by-environment (or genotype-by-region) interaction effects.

  • Genotype-by-environment interaction
head(results$marginal$df$pair_stabi_gl)
    x    prob   y
2 G01 0.55415 G02
3 G01 0.48535 G03
4 G01 0.52830 G04
5 G01 0.62005 G05
6 G01 0.51410 G06
7 G01 0.44720 G07
results$marginal$plots$pair_stabi_gl

Figure 7: Marginal pairwise probability of superior stability across environments (locations)

The results indicates the probability of genotypes at the x-axis being more stable than the genotype at the y-axis.

  • Genotype-by-region interaction
head(results$marginal$df$pair_stabi_gm)
    x    prob   y
2 G01 0.59370 G02
3 G01 0.51150 G03
4 G01 0.52400 G04
5 G01 0.56620 G05
6 G01 0.54615 G06
7 G01 0.50260 G07
results$marginal$plots$pair_stabi_gm

Figure 8: Marginal probability of superior stability across regions

The results indicates the probability of genotypes at the x-axis being more stable than the genotype at the y-axis.

3.5.1.6 Joint probability of superior performance and superior stability

Assuming that the genotypic main effects are independent of the variance of the genotype-by-environment interaction effects, the joint probability of performance and superior stability follows the same ideia as the probability of occurrence of two independent events:

\[ Pr[\hat{g}_j \in \Omega \cap var(\widehat{ge}_{jk}) \in \Omega] = Pr(\hat{g}_j \in \Omega) \times Pr[var(\widehat{ge}_{jk}) \in \Omega] \]

The results are accessed using the following commands:

head(results$marginal$df$joint_prob)
     ID    level category         prob
157 G01 Location    Joint 0.1102474950
158 G02 Location    Joint 0.0010212300
159 G03 Location    Joint 0.0446639700
160 G04 Location    Joint 0.0028105000
161 G05 Location    Joint 0.0001216475
162 G06 Location    Joint 0.0447423700
results$marginal$plots$joint_prob

Figure 9: Joint probability of superior performance and stability

3.5.2 Conditional output

The probabilities of superior performance can be investigated within individual environments. In other words, if you want to recommend an experimental genotype for a specific location. The conditional list is divided into two sublists: the first, df, contains the results compiled into data frames; and the second, plots, have the corresponding ggplots.

3.5.2.1 Probability of superior performance

It is exactly the same idea as previously stated, but instead of using the marginal genotypic effect of each genotype, we use the conditional genotypic effect (\(g_{jk} = g_j + ge_{jk}\)):

\[ Pr(\hat{g}_{jk} \in \Omega_k \vert y) = \frac{1}{S} \sum_{s=1}^S I(\hat{g}_{jk}^{(s)} \in \Omega_k \vert y) \]

The results are accessed using the following commands:

head(results$conditional$df$prob_env)
  env reg gen    prob
1 E01  R1 G01 0.17765
2 E01  R1 G02 0.01025
3 E01  R1 G03 0.09220
4 E01  R1 G04 0.01265
5 E01  R1 G05 0.00360
6 E01  R1 G06 0.19790
results$conditional$plots$prob_env

Figure 10: Probability of superior performance within enviroments (locations)

The grey cells are environments in which the genotype specified in the row was not evaluated.

results$conditional$plots$prob_reg

Figure 11: Probability of superior performance within regions

3.5.2.2 Pairwise probability of superior performance

A directly comparison between genotypes evaluated at the same environment. prob_sup() uses the following equations:

  • When increase = TRUE

\[ Pr(\hat{g}_{jk} > \hat{g}_{j^\prime k} \vert y) = \frac{1}{S} \sum_{s=1}^S I(\hat{g}_{jk}^{(s)} > \hat{g}_{j^\prime k}^{(s)} \vert y) \]

  • When increase = FALSE

\[ Pr(\hat{g}_{jk} < \hat{g}_{j^\prime k} \vert y) = \frac{1}{S} \sum_{s=1}^S I(\hat{g}_{jk}^{(s)} < \hat{g}_{j^\prime k}^{(s)} \vert y) \]

The pairwise comparisons and the ggplots are stored in lists. Therefore, the results are accessed using the following commands, taking the environment “E05” and the region “R2” as examples:

head(results$conditional$df$pwprob_env$E05)
    y   x  pwprob
1 G02 G01 0.95790
2 G03 G01 0.34910
3 G03 G02 0.03285
4 G04 G01 0.95620
5 G04 G03 0.96010
6 G04 G02 0.51755
results$conditional$plots$pwprob_env$E05

Figure 12: Parwise probabilities of superior performance within the environment (location) “E05”
head(results$conditional$df$pwprob_reg$R2)
    y   x     pwprob
1 G02 G01 0.96788333
2 G03 G01 0.60066667
3 G03 G02 0.09798333
4 G04 G01 0.95316667
5 G04 G03 0.86826667
6 G04 G02 0.36655000
results$conditional$plots$pwprob_reg$R2

Figure 13: Pairwise probabilities of superior performance within the region “R2”

Like the other heatmaps, we are evaluating the probability of genotypes at x-axis being superior than genotypes at the y-axis.

4 Wrap up

The estimated probabilities demonstrated in this tutorial from ProbBreed are related to some key question that constantly arises in plant breeding:

  • What is the risk of recommending a selection candidate?

  • What is the probability of a given selection candidate having good performance across or within environments?

  • What is the probability of a selection candidate having better performance than a check cultivar check?

  • How probable is it that a given selection candidate performs similarly across environments?

  • What are the chances that a given selection candidate is more stable than a cultivar check?

  • What is the probability that a given selection candidate having a superior and invariable performance across environments?

We hope this package is useful for you to answer some of these questions. Feel free to contact us if there any issues or suggestions for improvement.


5 References

Dias, K. O. G., Santos, J. P. R., Krause, M. D., Piepho, H.-P., Guimarães, L. J. M., Pastina, M. M., & Garcia, A. A. F. (2022). Leveraging probability concepts for cultivar recommendation in multi-environment trials. Theoretical and Applied Genetics, 135(4), 1385–1399. https://doi.org/10.1007/s00122-022-04041-y
Krause, M. D., Dias, K. O. G., Singh, A. K., & Beavis, W. D. (2023). Using soybean historical field trial data to study genotype by environment variation and identify mega-environments with the integration of genetic and non-genetic factors. bioRxiv : The Preprint Server for Biology. https://doi.org/10.1101/2022.04.11.487885
Sievert, C. (2020). Interactive web-based data visualization with r, plotly, and shiny. Chapman; Hall/CRC. https://plotly-r.com
Stan Development Team. (2023a). RStan: The R interface to Stan. https://mc-stan.org/
Stan Development Team. (2023b). Stan modeling language users guide and reference manual. https://mc-stan.org/
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org