Package ProbBreed

Calculating the risk of cultivar recommendation in multi-environment trials

Author
Affiliation

Saulo F. S. Chaves

Departamento de agronomia, Universidade Federal de Viçosa

1 Introduction

Welcome to ProbBreed! This guide will help you using the package’s functions, from fitting a Bayesian model to computing probabilities. Feel free to contact us if there are any issues via GitHub or e-mail. Further details are found within each function documentation (e.g., ?bayes_met) and in the papers Dias et al. (2022) and Chaves et al. (2024).

library(ProbBreed)

2 Step one - fit the Bayesian model

The first step is to fit a multi-environment Bayesian model using the bayes_met function. This function has a set of predefined models that are used to run Bayesian analyses using rstan. There are some important details that needs to be stressed:

1- Currently, the function has nine options of models (Figure 1).

Figure 1: Options to declare replications and/or blocks (repl), years (year) and regions (reg) effects in the bayes_met function. Users must substitute Repl, Block, Year, and Region by the name of the column that contains the information of replicates, blocks (if applicable), year (if available) and region (if available). RCDB and IBD are acronyms for randomized complete block design and incomplete block design, respectively. A model using only adjusted means can only be fitted with reg = NULL and year = NULL

These models differ according to the considered information: only locations, locations and years, locations and breeding regions, and locations, years and breeding regions. Should you consider an “environment” as a combination of environmental factors (for instance, location and planting dates within location), this information must be in the loc argument. All models will have gen and loc. If you want to consider the effect of regions (or mega-environment) and/or time factors (for instance, years or harvests), then reg and year must represent the column that contains the information on these effects. Otherwise, these arguments are set to NULL. You may control the experimental design effects in the repl argument. If repl = NULL, bayes_met will assume that you are entering with adjusted means of each environment, i.e., data without replicates. If you have data from trials laid out in randomized complete blocks, repl = 'Replicate'. Finally, if the multi-environment trials were laid out in incomplete blocks design, repl = c('Replicate', 'Block'). By default, when repl = NULL, reg and year must also be NULL.

2- You may change the number of iterations, chains, and cores. This will vary according to the data and the processing capacity of your machine. The default is set 2000 iterations (including warm-up), 2 cores and 4 chains. Be aware that the more iterations and chains, the longer the computation time. At the same time, results are likely more reliable, and mixing/convergence issues will diminish. By default, if more than one core is set, the function runs one chain per core. The number of cores depends on the processing capacity of your machine. Choose wisely.

3- Sometimes, bayes_met may yield warnings on possible mixing/convergence issues. More details on these problems and how to deal with them are available here. Usually, increasing the number of iterations is enough. If this does not work, bayes_met has several arguments passed to rstan::sampling() in which advanced users can configure the default sampling parameters (more details on bayes_met documentation: help("bayes_met")), and hopefully, diminish these issues. Nevertheless, before taking any measures - or even disposing of the model - we recommend doing the posterior predictive check using the extr_outs function. Checking the \(\hat{R}\) statistic, Bayesian p-values and Empirical vs Sampled density plot is particularly useful. If bayes_met showed warnings but the posterior predictive checks do not indicate serious issues, you can rely on the results and proceed with the analysis.

4- You can choose between a model with homogeneous or heterogeneous residual variances using res.het = FALSE and res.het = TRUE, respectively. When res.het = TRUE, there will be a variance for each location (or environment). By default, when repl = NULL, res.het must be TRUE.

5- The Bayesian models implemented in ProbBreed have predefined priors described by Dias et al. (2022). In summary: \[ x \sim \mathcal{N}\left(0,S^{[x]}\right) \] \[ \sigma \sim \mathcal{HalfCauchy}\left(0, S^{[\sigma]}\right) \]

where \(x\) can be any effect but the residual, and \(\sigma\) is the standard deviation of the likelihood. If an heterogeneous model is set (res.het = TRUE), then \(\sigma_k \sim HalfCauchy\left(0, S^{[\sigma_k]}\right)\).

The hyperpriors are set as follows: \[ S^{[x]} \sim HalfCauchy(0, \phi) \] where \(\phi\) is the known global hyperparameter defined such as \(\phi = max(y) \times 10\). By doing this, we restrain the hyperparameter to be finite and allow the model to take full advantage of the data to infer the posterior distribution since it provide weakly informative prior distributions, avoiding subjectivity that can possibly hamper the results. The half-Cauchy distribution was used to guarantee that variance components will not be negative.

Let us see how the nine models can be fitted using bayes_met(). The following models are only for educational purposes, and will not provide a valid output. We will use the terms “location”, “region” and “years” to avoid the possible confusion caused by the term “environment”, which can have several meanings.

2.1 Entry-mean model

\[ y_{jk} = \mu + l_k + g_j + \varepsilon_{jk} \]

where the \(y_{jk}\) is the phenotypic record of the \(j^{\text{th}}\) genotype in the \(k^{\text{th}}\) location, \(\mu\) is the intercept, \(l_k\) is the main effects of the \(k^{\text{th}}\) location, \(g_j\) is the main effect of the \(j^{\text{th}}\) genotype, and \(\varepsilon_{jk}\) is the residual effect.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = NULL,
          year = NULL,
          reg = NULL,
          res.het = TRUE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)
Warning

Within-environment probabilities of superior performance, and probabilities of superior stability are unavailable for this type of model, since we have no explicit genotype \(\times\) environment interaction effect in this case.

2.2 Randomized complete blocks design

2.2.1 Only locations

\[ y_{jkp} = \mu + l_k + b_{p(k)} + g_j + gl_{jk} + \varepsilon_{jkp} \]

where the \(y_{jkp}\) is the phenotypic record of the \(j^{\text{th}}\) genotype, allocated in the \(p^{\text{th}}\) block, in the \(k^{\text{th}}\) location. All other effects were previously defined but \(b_{p(k)}\), which is the effect of the \(p^{\text{th}}\) block in the \(k^{\text{th}}\) location, and \(gl_{jk}\), which correspond to the genotype-by-location interaction.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          year = NULL,
          reg = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

2.2.2 Locations and regions

\[ y_{jkwp} = \mu + m_w + l_k + b_{p(k)} + g_j + gl_{jk} + gm_{jw} + \varepsilon_{jkwp} \]

All effects were previously defined but \(m_w\) and \(gm_{jw}\), which are the main effects of region and genotypes-by-regions interaction, respectively.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          reg = "Region",
          year = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

2.2.3 Locations and years

\[ y_{jkhp} = \mu + t_h + l_k + b_{p(k)} + g_j + gl_{jk} + gt_{jh} + \varepsilon_{jkhp} \]

where \(t_h\) and \(t_{jh}\) are the main effect of years and the genotypes-by-years interaction effect, respectively.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          reg = NULL,
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

2.2.4 Locations, regions and years

\[ y_{jkhwp} = \mu + t_h + m_w + l_k + b_{p(k)} + g_j + gl_{jk} + gt_{jh} + gm_{jw} + \varepsilon_{jkhwp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = "Block",
          reg = "Region",
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

2.3 Incomplete blocks desing

2.3.1 Only locations

\[ y_{jkqp} = \mu + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + \varepsilon_{jkqp} \]

where the \(y_{jkqp}\) is the phenotypic record of the \(j^{\text{th}}\) genotype, allocated in the \(p^{\text{th}}\) block of the \(q^{\text{th}}\) replicate, in the \(k^{\text{th}}\) location. All other effects were previously defined but \(r_{q(k)}\), which is the effect of the \(q^{\text{th}}\) replicate in the \(k^{\text{th}}\) location. Note that the indices of \(b\) also change.

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = NULL,
          year = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4)

2.3.2 Locations and regions

\[ y_{jkwqp} = \mu + m_w + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + gm_{jw} + \varepsilon_{jkwqp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = "Region",
          year = NULL,
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4) 

2.3.3 Locations and years

\[ y_{jkhqp} = \mu + t_h + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + gt_{jh} + \varepsilon_{jkhqp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = NULL,
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4) 

2.3.4 Locations, regions and years

\[ y_{jkhwqp} = \mu + t_h + m_w + l_k + r_{q(k)} + b_{p(qk)} + g_j + gl_{jk} + gt_{jh} + gm_{jw} + \varepsilon_{jkhwqp} \]

mod = bayes_met(data = MyData, 
          gen = "Genotype", 
          loc = "Location",
          repl = c("Replicate", "Block"),
          reg = "Region",
          year = 'Year',
          res.het = FALSE,
          trait = "Phenotype",
          iter = 2000, cores = 2, chains = 4) 

2.4 Example dataset

For the next steps, we will use the maize data, which is contained within the package. This dataset was used by Dias et al. (2022) in the paper that proposed the methods. It contains 32 single-cross hybrids and four commercial checks (36 genotypes in total) evaluated in 16 locations across five regions or mega-environments. These trials were laid out in incomplete blocks design, using a block size of 6 and two replications per trial. This dataset was kindly provided by Embrapa Milho e Sorgo.

mod = bayes_met(data = maize, 
                gen = "Hybrid",
                loc = "Location",
                repl = c("Rep","Block"),
                trait = "GY", 
                reg = "Region",
                year = NULL, 
                res.het = TRUE, 
                iter = 4000, 
                cores = 4, 
                chain = 4)

ProbBreed uses rstan to fit Bayesian models, so packages and functions designed to manage and explore these models can be used in the output object of the bayes_met function. Some interesting options are rstantools, shinystan, and bayesplot.

3 Step two - explore the model’s outputs

After fitting the model, the next step is to extract outputs using extr_outs. This function also provides other useful information, like variance components and posterior predictive checks. Using the mod object from the previous step:

outs = extr_outs(model = mod,
                 probs = c(0.05, 0.95), 
                 verbose = TRUE)

where:

  • model is the model fitted using bayes_met

  • probs are the probabilities considered to calculate the quantiles and the highest posterior density (HPD)

This function provides an object of class extr, which is a list with the posterior distribution of each effect, the data generated by the model, the maximum posterior values of each effect, the variances of each effect, and quality parameters of the model (see below).

outs$variances
            effect   var     sd naive.se HPD_0.05 HPD_0.95
1              Rep 0.025  0.034    0.000    0.000    0.091
2            Block 0.219  0.047    0.001    0.148    0.301
3           Hybrid 0.216  0.076    0.001    0.117    0.359
4         Location 7.887  3.845    0.043    3.794   14.791
5  Hybrid:Location 0.361  0.068    0.001    0.254    0.477
6           Region 4.746 20.828    0.233    0.020   16.855
7    Hybrid:Region 0.061  0.040    0.000    0.006    0.136
8       error_env1 0.875  0.196    0.002    0.597    1.223
9       error_env2 0.942  0.257    0.003    0.593    1.420
10      error_env3 1.400  0.318    0.004    0.950    1.979
11      error_env4 0.597  0.152    0.002    0.389    0.877
12      error_env5 1.075  0.232    0.003    0.743    1.490
13      error_env6 1.457  0.326    0.004    0.995    2.071
14      error_env7 0.282  0.072    0.001    0.186    0.414
15      error_env8 2.009  0.479    0.005    1.319    2.879
16      error_env9 0.567  0.155    0.002    0.357    0.858
17     error_env10 0.624  0.146    0.002    0.416    0.889
18     error_env11 1.255  0.284    0.003    0.845    1.773
19     error_env12 0.452  0.116    0.001    0.292    0.664
20     error_env13 0.749  0.189    0.002    0.483    1.096
21     error_env14 1.827  0.403    0.005    1.270    2.553
22     error_env15 0.822  0.198    0.002    0.550    1.184
23     error_env16 1.818  0.407    0.005    1.237    2.546
outs$ppcheck
                  Diagnostics
p.val_max              0.3349
p.val_min              0.3201
p.val_median           0.5388
p.val_mean             0.4904
p.val_sd               0.5374
Eff_No_parameters    183.7621
WAIC2               3924.2884
mean_Rhat              1.0006
Eff_sample_size        0.7963

The S3 method plot will generate some useful plots, like the density plots and histograms built from the posterior effects (Figure 2). It can also build trace plots.

plot(outs, category = "density")
plot(outs, category = "histogram")
(a) Density plots
(b) Histograms
Figure 2: Posterior effects’ distribution as depicted by density plots and histograms

A particularly useful plot is the comparison between the empirical and sampled phenotype, which illustrates the model’s convergence (Figure 3). The more the density plots overlap, the more successful was the model in sampling the effects.

plot(outs)
Figure 3: Density plots comparing the distributions of empirical and sampled phenotypes. The more overlapped, the better.

See more plot options in help('plot.extr').

4 Step three - compute the probabilities

With the outputs extracted by extr_outs, we can calculate the probabilities of superior performance and superior stability of the evaluated genotypes with prob_sup:

results = prob_sup(extr = outs, 
                   int = .2,
                   increase = TRUE, 
                   save.df = FALSE, 
                   verbose = FALSE)

where:

  • increase: The objective is for increasing (TRUE, default) or decreasing (FALSE) the trait mean

  • extr: An extr object that contains the outputs of the extr_outs() function.

  • int: The selection intensity, expressed in decimal values. In the example, the selection intensity is 20%.

  • save.df: TRUE if you want to save the data frames containing the calculated probabilities in the working directory, FALSE otherwise.

This function generates an object of class probsup, which consists of two lists: across and within. The first list contains the probabilities of superior performance and superior stability across environments, while the second contains the probabilities of superior performance within environments. probsup is also compatible with the S3 method plot. See the details below or in help("plot.probsup")

4.1 Across-environments results

The across list has the following outputs:

4.1.1 HPD of the posterior genotypic main effects

head(results$across$g_hpd)
  gen           g      HPD95      HPD97.5         HPD5      HPD7.5
1  G1  0.63126869  1.0259712  1.110047110  0.235731433  0.16498478
2 G10 -0.47653152 -0.0817441 -0.005033713 -0.871636673 -0.95611857
3 G11 -0.07243611  0.3106306  0.379956875 -0.449724066 -0.52916081
4 G12  0.24867322  0.6426858  0.721897025 -0.137741834 -0.20659020
5 G13  0.37807601  0.7693715  0.851806417 -0.009092229 -0.08398443
6 G14 -0.25286212  0.1262817  0.202358345 -0.644663245 -0.72620990
plot(results, category = "hpd")
Figure 4: Caterpillar plot represeting the Highest posterior density (HPD) of the posterior genotypic main effects

4.1.2 Probability of superior performance

Let \(\Omega\) be a subset of the high-performance selected genotypes according to the intensity of selection. A given genotype \(j\) will belong to \(\Omega\) if its genotypic marginal value \(\left(\hat{g}_j\right)\) is high (or low) enough compared to its peers. We can emulate the occurrence of \(S\) trials \(\left(s = 1, 2, \dots, S\right)\) by leveraging discretized samples of Monte Carlo posterior distributions of the fitted Bayesian models. Then, the probability of the \(j^{\text{th}}\) genotype belonging to \(\Omega\) is its ratio of success \(\left(\hat{g}_j \in \Omega\right)\) events over the total number of sampled events \(\left[S = \left(\hat{g}_j \in \Omega\right) + \left(\hat{g}_j \notin \Omega\right)\right]\), as follows:

\[ Pr\left(\hat{g}_j \in \Omega \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_j^{(s)} \in \Omega \vert y\right) \] where \(I\left(\hat{g}_j^{(s)} \in \Omega \vert y\right)\) is an indicator variable that can assume two values: (1) if \(\hat{g}_j \in \Omega\) in the \(s^{\text{th}}\) sample, and (0) otherwise. The results provided by prob_sup can be accessed as follows:

head(results$across$perfo)
    ID     prob
36  G9 0.996250
1   G1 0.920375
22 G29 0.798875
24 G30 0.643500
5  G13 0.606750
35  G8 0.558875
plot(results)
Figure 5: Lollipop plot with the probabilities of superior performance of selection candidates

4.1.3 Pairwise probability of superior performance

To directly compare a selection candidate with a commercial check, or another promising genotype, we can calculate the pairwise probability of superior performance. This metric computes the probability of a genotype \(j\) performing better than a genotype \(j^\prime\):

\[ Pr\left(\hat{g}_{j} > \hat{g}_{j^\prime} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{j}^{(s)} > \hat{g}_{j^\prime}^{(s)} \vert y\right) \] Note that this equation applies if the selection direction is to increase the trait value. If the aim is to decrease it (for example, plant height or susceptibility to a disease), you can set increase = FALSE, and the following equation is employed:

\[ Pr\left(\hat{g}_{j} < \hat{g}_{j^\prime} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{j}^{(s)} < \hat{g}_{j^\prime}^{(s)} \vert y\right) \] The results are accessed using the following commands:

results$across$pair_perfo[1:5, 1:5]
          G1      G10      G11     G12      G13
G1  0.000000 0.000250 0.012625 0.11600 0.210125
G10 0.999750 0.000000 0.892500 0.98975 0.995375
G11 0.987375 0.107500 0.000000 0.84200 0.923875
G12 0.884000 0.010250 0.158000 0.00000 0.662250
G13 0.789875 0.004625 0.076125 0.33775 0.000000
plot(results, category = "pair_perfo")
Figure 6: Heatmap with the pairwise probabilities of superior performance. The heatmap depicts the probability of genotypes in the x-axis being superior to those in the y-axis.

4.1.4 Probability of superior stability

Considering genotypes that have low genotype-by-environment interaction variance \(\left[var\left(ge_{jk}\right)\right]\) as stable, the probability of superior stability is given by the following equation:

\[ Pr\left[var\left(\widehat{ge}_{jk}\right) \in \Omega \vert y\right] = \frac{1}{S} \sum_{s=1}^S I\left[var\left(\widehat{ge}_{jk}^{(s)}\right) \in \Omega \vert y\right] \] where \(I\left[var\left(\widehat{ge}_{jk}^{(s)}\right) \in \Omega \vert y\right]\) indicates if \(var\left(\widehat{ge}_{jk}^{(s)}\right)\) exists in \(\Omega\) (1) or not (0) for the \(j^{th}\) genotype on the \(k^{th}\) environment. Note that this probability can only be computed across environments since it depends on \(var\left(\widehat{ge}_{jk}\right)\). Here are the codes:

head(results$across$stabi$gl)
    ID     prob
8  G16 0.504875
25 G31 0.447750
28 G34 0.410125
4  G12 0.404125
29 G35 0.389875
24 G30 0.371375
head(results$across$stabi$gm)
    ID     prob
28 G34 0.282625
24 G30 0.281250
13 G20 0.276500
32  G5 0.264500
36  G9 0.259000
15 G22 0.257625
plot(results, category = "stabi")
Figure 7: Lollipop plot with the probabilities of superior stability (per location and region) of selection candidates

4.1.5 Pairwise probability of superior stability

We can also compare the stability of two selection candidates, or a experimental genotype and a commercial check for example. The following equation is used:

\[ Pr\left[var\left(\widehat{ge_{jk}}\right) < var\left(\widehat{ge}_{ik}\right) \vert y\right] = \frac{1}{S} \sum_{s=1}^S{I \left[var\left(\widehat{ge}_{jk}^{(s)}\right) < var\left(\widehat{ge}_{ik}^{(s)}\right) \vert y\right]} \]

Note that, in this case, we are interested in the genotype that has a lower variance of the genotype-by-environment (or genotype-by-region) interaction effects.

results$across$pair_stabi$gl[1:5, 1:5]
          G1      G10      G11      G12      G13
G1  0.000000 0.744125 0.768750 0.885750 0.778375
G10 0.255875 0.000000 0.526125 0.709875 0.551500
G11 0.231250 0.473875 0.000000 0.690750 0.523750
G12 0.114250 0.290125 0.309250 0.000000 0.338500
G13 0.221625 0.448500 0.476250 0.661500 0.000000
results$across$pair_stabi$gm[1:5, 1:5]
         G1      G10      G11      G12      G13
G1  0.00000 0.555000 0.578250 0.571750 0.583250
G10 0.44500 0.000000 0.525375 0.525125 0.534625
G11 0.42175 0.474625 0.000000 0.500000 0.509000
G12 0.42825 0.474875 0.500000 0.000000 0.506625
G13 0.41675 0.465375 0.491000 0.493375 0.000000
plot(results, category = "pair_stabi")
Figure 8: Heatmap with the pairwise probabilities of superior stability. The heatmap depicts the probability of genotypes in the x-axis having lower GEI variance than those in the y-axis.

4.1.6 Joint probability of superior performance and superior stability

Assuming that the genotypic main effects are independent of the variance of the genotype-by-environment interaction effects, the joint probability of performance and superior stability follows the same idea as the probability of occurrence of two independent events:

\[ Pr\left[\hat{g}_j \in \Omega \cap var\left(\widehat{ge}_{jk}\right) \in \Omega\right] = Pr\left(\hat{g}_j \in \Omega\right) \times Pr\left[var\left(\widehat{ge}_{jk}\right) \in \Omega\right] \]

The results are accessed using the following commands:

head(results$across$joint)
     ID    level category         prob
145  G1 Location    Joint 0.0280714375
146 G10 Location    Joint 0.0000773125
147 G11 Location    Joint 0.0066858281
148 G12 Location    Joint 0.1502839844
149 G13 Location    Joint 0.1213500000
150 G14 Location    Joint 0.0010070000
plot(results, category = "joint")
Figure 9: Lollipop plot with the joint probability of superior performance and stability of selection candidates.

4.2 Within-environments results

The probabilities of superior performance can be investigated within individual environments. This is useful for specific recommendations. Here are the available results:

4.2.1 Probability of superior performance

It is exactly the same idea as previously stated, but instead of using the marginal genotypic effect of each genotype, we use the within environment genotypic effect \(\left(g_{jk} = g_j + ge_{jk}\right)\):

\[ Pr\left(\hat{g}_{jk} \in \Omega_k \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} \in \Omega_k \vert y\right) \]

The results are accessed using the following commands:

head(results$within$perfo$gl)
  gen       E1      E10      E11      E12      E13      E14      E15      E16
1  G1 0.984125 0.268750 0.376750 0.423875 0.273000 0.952625 0.325500 0.969875
2 G10 0.010375 0.001250 0.001875 0.002750 0.021375 0.069375 0.000125 0.040875
3 G11 0.005125 0.094500 0.097500 0.005000 0.599500 0.123625 0.736125 0.093625
4 G12 0.158000 0.084250 0.045500 0.053250 0.061375 0.304750 0.160500 0.569500
5 G13 0.205500 0.182500 0.130500 0.129000 0.063875 0.592625 0.805375 0.072250
6 G14 0.000000 0.005375 0.111750 0.156750 0.028750 0.067000 0.000625 0.022875
        E2       E3       E4       E5       E6       E7       E8       E9
1 0.898875 0.625875 0.906625 0.530625 0.306750 0.005125 0.852625 0.376125
2 0.096750 0.195250 0.000875 0.042625 0.024875 0.002000 0.000125 0.042250
3 0.019625 0.031250 0.017500 0.106750 0.329625 0.014625 0.043500 0.223250
4 0.418500 0.605375 0.195750 0.376750 0.435250 0.820375 0.164750 0.321500
5 0.231500 0.697125 0.685750 0.613250 0.820375 0.457125 0.113125 0.535500
6 0.003000 0.045875 0.125750 0.115875 0.616250 0.026250 0.038750 0.068500
plot(results, category = "perfo", level = "within")
Figure 10: Heatmap with the within-environment (location or region) probabilities of superior performance of the selection candidates

The grey cells are environments in which the genotype specified in the row was not evaluated.

4.2.2 Pairwise probability of superior performance

A directly comparison between genotypes evaluated at the same environment. We use the following equations:

  • When increase = TRUE

\[ Pr\left(\hat{g}_{jk} > \hat{g}_{j^\prime k} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} > \hat{g}_{j^\prime k}^{(s)} \vert y\right) \]

  • When increase = FALSE

\[ Pr\left(\hat{g}_{jk} < \hat{g}_{j^\prime k} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} < \hat{g}_{j^\prime k}^{(s)} \vert y\right) \] Since the pairwise comparisons are performed per environment, it is not possible to illustrate them all in a single plot. In fact, plot generates a list of plots, with each element corresponding to an environment. It is strongly recommended that you store this list in an object. Otherwise, plot will simply try to plot them all at once. For that reason, plot will double check if you are using an object to store the results. It will print the following question in the console: “Are you using an object to store the outputs of this function? Enter Y/n:”. If you type “n”, it will break and give you the following message “There is more than one plot, so it is not a good idea to plot them all at once. Please, store the output of this function in an object”. If you plot “Y”, it will build the plots.

The pairwise comparisons and the ggplots are stored in lists. Therefore, the results are accessed using the following commands, taking the environment “E13” and the region “TB” as examples:

head(results$within$perfo$gl)
pairsupwithin = plot(results, category = "pair_perfo", level = "within")
pairsupwithin$gl$E13
pairsupwithin$gm$TB
  gen       E1      E10      E11      E12      E13      E14      E15      E16
1  G1 0.984125 0.268750 0.376750 0.423875 0.273000 0.952625 0.325500 0.969875
2 G10 0.010375 0.001250 0.001875 0.002750 0.021375 0.069375 0.000125 0.040875
3 G11 0.005125 0.094500 0.097500 0.005000 0.599500 0.123625 0.736125 0.093625
4 G12 0.158000 0.084250 0.045500 0.053250 0.061375 0.304750 0.160500 0.569500
5 G13 0.205500 0.182500 0.130500 0.129000 0.063875 0.592625 0.805375 0.072250
6 G14 0.000000 0.005375 0.111750 0.156750 0.028750 0.067000 0.000625 0.022875
        E2       E3       E4       E5       E6       E7       E8       E9
1 0.898875 0.625875 0.906625 0.530625 0.306750 0.005125 0.852625 0.376125
2 0.096750 0.195250 0.000875 0.042625 0.024875 0.002000 0.000125 0.042250
3 0.019625 0.031250 0.017500 0.106750 0.329625 0.014625 0.043500 0.223250
4 0.418500 0.605375 0.195750 0.376750 0.435250 0.820375 0.164750 0.321500
5 0.231500 0.697125 0.685750 0.613250 0.820375 0.457125 0.113125 0.535500
6 0.003000 0.045875 0.125750 0.115875 0.616250 0.026250 0.038750 0.068500
Are you using an object to store the outputs of this function? Enter Y/n: 
(a) Location E13
(b) Region TB
Figure 11: Heatmap with the within-environment pairwise probabilities of superior performance

Like the other heatmaps, we are evaluating the probability of genotypes at x-axis being superior than genotypes at the y-axis.

5 Wrap up

The estimated probabilities demonstrated in this tutorial from ProbBreed are related to some key question that constantly arises in plant breeding:

  • What is the risk of recommending a selection candidate?

  • What is the probability of a given selection candidate having good performance across or within environments?

  • What is the probability of a selection candidate having better performance than a check cultivar check?

  • How probable is it that a given selection candidate performs similarly across environments?

  • What are the chances that a given selection candidate is more stable than a cultivar check?

  • What is the probability that a given selection candidate having a superior and invariable performance across environments?

We hope this package is useful for you to answer some of these questions. Feel free to contact us if there any issues or suggestions for improvement.


6 References

Chaves, S. F. S., Krause, M. D., Dias, L. A. S., Garcia, A. A. F., & Dias, K. O. G. (2024). ProbBreed: A novel tool for calculating the risk of cultivar recommendation in multienvironment trials. G3 GenesGenomesGenetics, 14(3), jkae013. https://doi.org/10.1093/g3journal/jkae013
Dias, K. O. G., Santos, J. P. R., Krause, M. D., Piepho, H.-P., Guimarães, L. J. M., Pastina, M. M., & Garcia, A. A. F. (2022). Leveraging probability concepts for cultivar recommendation in multi-environment trials. Theoretical and Applied Genetics, 135(4), 1385–1399. https://doi.org/10.1007/s00122-022-04041-y