PPC Calibration plots#352
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #352 +/- ##
==========================================
- Coverage 99.18% 99.03% -0.15%
==========================================
Files 35 36 +1
Lines 6132 6530 +398
==========================================
+ Hits 6082 6467 +385
- Misses 50 63 +13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ExamplesThese should allow for some tests of these functions. Creating example datalibrary(bayesplot)
ymin <- range(example_y_data(), example_yrep_draws())[1]
ymax <- range(example_y_data(), example_yrep_draws())[2]
# Observations and posterior predictive probabilitites.
y <- rbinom(length(example_y_data()), 1, (example_y_data() - ymin) / (ymax - ymin))
prep <- (example_yrep_draws() - ymin) / (ymax - ymin)
groups <- example_group_data()PAVA Calibration overlayBasic ppc_calibration_overlay(y, prep[1:50,])Grouped ppc_calibration_overlay_grouped(y, prep[1:50,], groups)PAVA CalibrationThis isn't yet quite what we want. Now the interval is not what we show in the paper. There, we use consistency intervals, that is, intervals centered at the diagonal displaying, where the calibration curve should lie, i.e. the posterior mean should stay within these bounds. ppc_calibration(y, prep)ppc_calibration_grouped(y, prep, groups) |
jgabry
left a comment
There was a problem hiding this comment.
This all sounds good, thanks @TeemuSailynoja. I made a few small review comments/questions. In addition to those questions, when you say
This isn't yet quite what we want. Now the interval is not what we show in the paper.
you mean that we will want to change this to use the consistency intervals you use in the paper, right? Do you think it's at all useful to give the user the option to choose which kind of interval? Or just strictly better to use the consistency intervals? I hadn't really thought about that.
| if (requireNamespace("monotone", quietly = TRUE)) { | ||
| monotone <- monotone::monotone | ||
| } else { | ||
| monotone <- function(y) { | ||
| stats::isoreg(y)$yf | ||
| } | ||
| } |
There was a problem hiding this comment.
Is there an advantage to using monotone::monotone instead of stats::isoreg?
There was a problem hiding this comment.
That is, does it do something slightly better? Or the same thing more efficiently? I've seen stats::isoreg before but I had never seen the monotone package. If there's no difference then it's probably not worth checking for the monotone package. If it's better then we could put monotone in Suggests and then check for it like you do here.
There was a problem hiding this comment.
monotone offers an implementation of the algorithm that is noticeably faster for large samples.
I think it would be good to add it to the suggests.
| #' @rdname PPC-calibration | ||
| #' @export | ||
| ppc_calibration_overlay <- function( | ||
| y, prep, ..., linewidth = 0.25, alpha = 0.5) { |
There was a problem hiding this comment.
So for these functions prep is a matrix of probabilities and not actually a matrix of draws of binary outcomes from the posterior predictive distribution, right? I think in that case the argument name prep makes sense. But the description at the top of the file says
Assess the calibration of the predictive distributions
yrepin relation to the data `y'
which makes it sound like the user should give us yrep. So I think we just need to reconcile how we describe this to the user.
There was a problem hiding this comment.
Your interpretation is right. The description needs more clarity. yrep can be made to be accepted by ppc_calibration (without overlay).
Leaving an option to choose is perhaps the best, as long as the difference is explained in the documentation. Confidence = "Where do we think the calibration curve for our model lies." |
|
Ok great, thanks for the replies. Sounds good to me. |
|
I've now updated the plotting functions ( Please take a look at the changes whenever you have a moment @avehtari, @TeemuSailynoja. I'm happy to receive any feedback and improve things further based on your thoughts. Thank you! |
|
The first comment says
and then
It would be nice to list the loo versions in the list of implemented functions. We need also better documentation for what are the expected arguments and how the plots are computed. With where posterior_predict gives posterior draws of counts and not probabilities, and then E_loo compute LOO-expectation so that we get single probability for each observation and In addition, for |
done |
See corresponding vignette: vignettes/ppc-calibration.Rmd |
|
@jgabry @avehtari This repo is ready for final review. It would be nice if you can have a look. Probably you can pay specific attention to how I incorporated this vignette at the moment as I was not sure how to do this correctly. Thank you! |
jgabry
left a comment
There was a problem hiding this comment.
Here's a first round of review comments, sorry for the delay!
| ord = order(.data$value), | ||
| y_id = .data$ord, | ||
| value = .data$value[.data$ord], | ||
| cep = monotone(y[.data$ord]) |
There was a problem hiding this comment.
I think here:
- ord = order(.data$value) is position-within-group
- cep = monotone(y[.data$ord]) indexes the full y
For any group after the first, does this compute CEPs using the wrong observations? Do we need to preserve the original y_id from .ppd_data() and index y with the ordered original IDs?
| help_text = TRUE, | ||
| B = 200, | ||
| show_mean = TRUE, | ||
| show_qdots = TRUE, |
There was a problem hiding this comment.
If show_qdots = TRUE by default then the ggdist package is required for the default plot. But we only list ggdist in Suggests. So maybe we should either move ggdist to Imports or make the default show_qdots = FALSE. If we do move ggdist to Imports than any other functions in the package that use it can remove code that checks for it (code like suggested_package("ggdist"), which we have later in this file and in other files). What do you think?
There was a problem hiding this comment.
Actually now that I think about it, maybe we already have this problem also for other function in the package that use ggdist. Since we're now using it in multiple places in the package we could just Import it and remove all code for checking if it's installed. What do you think?
| .loo_resampling_probs <- function(w) { | ||
| if (!all(is.finite(w))) { | ||
| abort("All values in 'lw' must be finite.") | ||
| } | ||
| p <- if (any(w < 0)) { | ||
| # Treat negative entries as log-weights and stabilize before exponentiating. | ||
| exp(w - max(w)) | ||
| } else { | ||
| w | ||
| } | ||
| total <- sum(p) | ||
| if (!is.finite(total) || total <= 0) { | ||
| rep(1 / length(w), length(w)) | ||
| } else { | ||
| p / total | ||
| } | ||
| } |
There was a problem hiding this comment.
If I understand correctly, this function treats weights as log weights only if any entry is negative. But lw is documented as log weights, and I think valid unnormalized log weights can all be positive, right? Are we always assuming already normalized log weights?
| if (any(y < 0 | y > 1)) { | ||
| abort("'y' must contain values in [0, 1] for calibration.") | ||
| } |
There was a problem hiding this comment.
This would still allow fractional y not just binary y. Do we want to check if all y are equal to 0 or 1 or are we allowing e.g. y = 0.1?
| if (any(yrep < 0 | yrep > 1)) { | ||
| abort("Values of 'yrep' should be binary outcomes in [0, 1].") | ||
| } |
There was a problem hiding this comment.
Same thing as with y above, this allows for y_rep between 0 and 1, not just equal to 0 and 1.
| expect_gte(min(p$data$lb), 0) | ||
| }) | ||
|
|
||
| test_that("ppc_calibration recovers identity trend for calibrated data", { |
There was a problem hiding this comment.
This test has diffr::expect_doppelganger so it needs the same skip logic as the other visual tests:
testthat::skip_on_cran()
testthat::skip_if_not_installed("vdiffr")
skip_on_r_oldrel()
skipping on oldrel is used because sometimes in the past SVGs are slightly different (not in a way that the human eye can detect) and cause failures on GitHub Actions.
| #' @rdname PPC-calibration | ||
| #' @export | ||
| ppc_calibration_overlay <- function( | ||
| y, prep, ..., prob = NULL, linewidth = 0.25, alpha = 0.2) { |
There was a problem hiding this comment.
It seems like prob is ignored, right? I think we should either remove it or document that it is accepted only for API consistency but has no effect. (Unless I'm wrong and it's actually serving a purpose here)
| #' @rdname PPC-calibration | ||
| #' @export | ||
| ppc_calibration_overlay_grouped <- function( | ||
| y, prep, group, ..., prob = NULL, linewidth = 0.25, alpha = 0.2) { |




This is my work in progress of the pava calibration plots discussed in #343
Currently implemented:
ppc_calibration_overlay()ppc_calibration_overlay_grouped()ppc_calibration()ppc_calibration_grouped()ppc_loo_calibration()ppc_loo_calibration_grouped()ppc_calibration_data()Needs:
ppc_calibration().ppc_calibration_data()be exposed to users?