PPC Calibration plots by TeemuSailynoja · Pull Request #352 · stan-dev/bayesplot

TeemuSailynoja · 2025-05-19T15:23:25Z

This is my work in progress of the pava calibration plots discussed in #343

Currently implemented:

ppc_calibration_overlay()
ppc_calibration_overlay_grouped()
ppc_calibration()
ppc_calibration_grouped()
ppc_loo_calibration()
ppc_loo_calibration_grouped()
ppc_calibration_data()

Needs:

Fast example to test functions.
Fix intervals in ppc_calibration()
- update (2026-05-05): using currently only pointwise CI @florence-bockting
Also example use in documentation
LOO versions
Should .ppc_calibration_data() be exposed to users?
- update (2026-04-15): user-facing function after refactoring by @florence-bockting
tests
check that the input parameter names and default values make sense and are intuitive
Add documentation and comments to the code also.
- update (2026-05-04): included a vignette that documents ppc calibration plots @florence-bockting
- update (2026-05-06): updated function documentation @florence-bockting

codecov-commenter · 2025-05-19T15:30:05Z

Codecov Report

❌ Patch coverage is 96.73367% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.03%. Comparing base (ce9867c) to head (21e4d63).

Files with missing lines	Patch %	Lines
R/ppc-calibration.R	97.18%	11 Missing ⚠️
R/bayesplot-ggplot-themes.R	71.42%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #352      +/-   ##
==========================================
- Coverage   99.18%   99.03%   -0.15%     
==========================================
  Files          35       36       +1     
  Lines        6132     6530     +398     
==========================================
+ Hits         6082     6467     +385     
- Misses         50       63      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TeemuSailynoja · 2025-05-22T12:08:11Z

Examples

These should allow for some tests of these functions.

Creating example data

library(bayesplot)
ymin <- range(example_y_data(), example_yrep_draws())[1]
ymax <- range(example_y_data(), example_yrep_draws())[2]
# Observations and posterior predictive probabilitites.
y <- rbinom(length(example_y_data()), 1, (example_y_data() - ymin) / (ymax - ymin))
prep <- (example_yrep_draws() - ymin) / (ymax - ymin)
groups <- example_group_data()

PAVA Calibration overlay

Basic

ppc_calibration_overlay(y, prep[1:50,])

Grouped

ppc_calibration_overlay_grouped(y, prep[1:50,], groups)

PAVA Calibration

This isn't yet quite what we want. Now the interval is not what we show in the paper. There, we use consistency intervals, that is, intervals centered at the diagonal displaying, where the calibration curve should lie, i.e. the posterior mean should stay within these bounds.
In this implementation, I'm plotting a confidence interval, which shows, where we think the curve lies, i.e. the diagonal should be included.

ppc_calibration(y, prep)

ppc_calibration_grouped(y, prep, groups)

jgabry

This all sounds good, thanks @TeemuSailynoja. I made a few small review comments/questions. In addition to those questions, when you say

This isn't yet quite what we want. Now the interval is not what we show in the paper.

you mean that we will want to change this to use the consistency intervals you use in the paper, right? Do you think it's at all useful to give the user the option to choose which kind of interval? Or just strictly better to use the consistency intervals? I hadn't really thought about that.

jgabry · 2025-05-22T22:18:51Z

+  if (requireNamespace("monotone", quietly = TRUE)) {
+    monotone <- monotone::monotone
+  } else {
+    monotone <- function(y) {
+      stats::isoreg(y)$yf
+    }
+  }


Is there an advantage to using monotone::monotone instead of stats::isoreg?

That is, does it do something slightly better? Or the same thing more efficiently? I've seen stats::isoreg before but I had never seen the monotone package. If there's no difference then it's probably not worth checking for the monotone package. If it's better then we could put monotone in Suggests and then check for it like you do here.

monotone offers an implementation of the algorithm that is noticeably faster for large samples.
I think it would be good to add it to the suggests.

Ok sounds good

jgabry · 2025-05-22T22:24:30Z

+#' @rdname PPC-calibration
+#' @export
+ppc_calibration_overlay <- function(
+    y, prep, ..., linewidth = 0.25, alpha = 0.5) {


So for these functions prep is a matrix of probabilities and not actually a matrix of draws of binary outcomes from the posterior predictive distribution, right? I think in that case the argument name prep makes sense. But the description at the top of the file says

Assess the calibration of the predictive distributions yrep in relation to the data `y'

which makes it sound like the user should give us yrep. So I think we just need to reconcile how we describe this to the user.

Your interpretation is right. The description needs more clarity. yrep can be made to be accepted by ppc_calibration (without overlay).

TeemuSailynoja · 2025-06-03T12:03:36Z

This all sounds good, thanks @TeemuSailynoja. I made a few small review comments/questions. In addition to those questions, when you say

This isn't yet quite what we want. Now the interval is not what we show in the paper.

you mean that we will want to change this to use the consistency intervals you use in the paper, right? Do you think it's at all useful to give the user the option to choose which kind of interval? Or just strictly better to use the consistency intervals? I hadn't really thought about that.

Leaving an option to choose is perhaps the best, as long as the difference is explained in the documentation.

Confidence = "Where do we think the calibration curve for our model lies."
Consistency = "Where should the curve of a consistent model lie."

jgabry · 2025-06-03T16:24:45Z

Ok great, thanks for the replies. Sounds good to me.

…_ppc_calibration

…alibration

florence-bockting · 2026-04-09T12:27:56Z

I've now updated the plotting functions (pcc_calibration*).
Unfortunately, I wasn't able to use the most recent changes from this PR; the code was failing because of several undefined arguments in the functions. So, I went one commit backwards in history and rebuilt the "confidence" vs. "consistency" version from scratch. At the same time, I created a short vignette that explains the logic behind it and shows the current visualizations together with their customization options (for convenience you find a rendered html attached).
The simulated data used in the Vignette isn’t ideal yet (I’m still building some intuition for how to best construct the dataset), so I would appreciate any input or suggestions you might have.

Please take a look at the changes whenever you have a moment @avehtari, @TeemuSailynoja. I'm happy to receive any feedback and improve things further based on your thoughts. Thank you!

ppc-calibration.html

avehtari · 2026-04-14T17:52:00Z

The first comment says

Currently implemented:
ppc_calibration_overlay()
ppc_calibration_overlay_grouped()
ppc_calibration()
ppc_calibration_grouped()
.ppc_calibration_data() - internal function

and then

Needs:
...

LOO versions

It would be nice to list the loo versions in the list of implemented functions.

We need also better documentation for what are the expected arguments and how the plots are computed. With reliabilittydiag() I had used

rd <- reliabilitydiag(
  EMOS = pmin(E_loo(0 + (posterior_predict(fit_betabinomial2b) > th), 
                    loo(fit_betabinomial2b)$psis_object)$value, 1),
  y = as.numeric(cu_df_b$cu > th)
)

where posterior_predict gives posterior draws of counts and not probabilities, and then E_loo compute LOO-expectation so that we get single probability for each observation and reliabilitydiag uses then bootstrap idea for the variability. Now ppc_calibration_overlay() / ppc_calibration() assumes we directly have probabilities (which we soon can get easily from brms).

In addition, for ppc_loo_calibration() it would be useful to mention that resampling is used, as some this is different from use of E_loo()

florence-bockting · 2026-04-15T10:39:56Z

It would be nice to list the loo versions in the list of implemented functions.

done

florence-bockting · 2026-04-21T07:36:30Z

We need also better documentation for what are the expected arguments and how the plots are computed.

See corresponding vignette: vignettes/ppc-calibration.Rmd

…ation_data()

florence-bockting · 2026-05-08T10:47:45Z

@jgabry @avehtari This repo is ready for final review. It would be nice if you can have a look.
I created also a vignette for the ppc-calibration functions. This vignette should be "online only" as it introduces brms as dependency which we probably don't want to have as dependency in bayesplot "just" because of the vignette.

Probably you can pay specific attention to how I incorporated this vignette at the moment as I was not sure how to do this correctly.

Thank you!

jgabry

Here's a first round of review comments, sorry for the delay!

jgabry · 2026-05-28T10:03:21Z

+        ord = order(.data$value),
+        y_id = .data$ord,
+        value = .data$value[.data$ord],
+        cep = monotone(y[.data$ord])


I think here:

ord = order(.data$value) is position-within-group

cep = monotone(y[.data$ord]) indexes the full y

For any group after the first, does this compute CEPs using the wrong observations? Do we need to preserve the original y_id from .ppd_data() and index y with the ordered original IDs?

jgabry · 2026-05-28T10:07:27Z

+    help_text = TRUE,
+    B = 200,
+    show_mean = TRUE,
+    show_qdots = TRUE,


If show_qdots = TRUE by default then the ggdist package is required for the default plot. But we only list ggdist in Suggests. So maybe we should either move ggdist to Imports or make the default show_qdots = FALSE. If we do move ggdist to Imports than any other functions in the package that use it can remove code that checks for it (code like suggested_package("ggdist"), which we have later in this file and in other files). What do you think?

Actually now that I think about it, maybe we already have this problem also for other function in the package that use ggdist. Since we're now using it in multiple places in the package we could just Import it and remove all code for checking if it's installed. What do you think?

jgabry · 2026-05-28T10:10:08Z

+.loo_resampling_probs <- function(w) {
+  if (!all(is.finite(w))) {
+    abort("All values in 'lw' must be finite.")
+  }
+  p <- if (any(w < 0)) {
+    # Treat negative entries as log-weights and stabilize before exponentiating.
+    exp(w - max(w))
+  } else {
+    w
+  }
+  total <- sum(p)
+  if (!is.finite(total) || total <= 0) {
+    rep(1 / length(w), length(w))
+  } else {
+    p / total
+  }
+}


If I understand correctly, this function treats weights as log weights only if any entry is negative. But lw is documented as log weights, and I think valid unnormalized log weights can all be positive, right? Are we always assuming already normalized log weights?

jgabry · 2026-05-28T10:12:28Z

+  if (any(y < 0 | y > 1)) {
+    abort("'y' must contain values in [0, 1] for calibration.")
+  }


This would still allow fractional y not just binary y. Do we want to check if all y are equal to 0 or 1 or are we allowing e.g. y = 0.1?

jgabry · 2026-05-28T10:13:29Z

+      if (any(yrep < 0 | yrep > 1)) {
+        abort("Values of 'yrep' should be binary outcomes in [0, 1].")
+      }


Same thing as with y above, this allows for y_rep between 0 and 1, not just equal to 0 and 1.

jgabry · 2026-05-28T10:17:08Z

+  expect_gte(min(p$data$lb), 0)
+})
+
+test_that("ppc_calibration recovers identity trend for calibrated data", {


This test has diffr::expect_doppelganger so it needs the same skip logic as the other visual tests:

testthat::skip_on_cran() testthat::skip_if_not_installed("vdiffr") skip_on_r_oldrel()

skipping on oldrel is used because sometimes in the past SVGs are slightly different (not in a way that the human eye can detect) and cause failures on GitHub Actions.

jgabry · 2026-05-28T10:19:16Z

+#' @rdname PPC-calibration
+#' @export
+ppc_calibration_overlay <- function(
+    y, prep, ..., prob = NULL, linewidth = 0.25, alpha = 0.2) {


It seems like prob is ignored, right? I think we should either remove it or document that it is accepted only for API consistency but has no effect. (Unless I'm wrong and it's actually serving a purpose here)

jgabry · 2026-05-28T10:19:36Z

+#' @rdname PPC-calibration
+#' @export
+ppc_calibration_overlay_grouped <- function(
+    y, prep, group, ..., prob = NULL, linewidth = 0.25, alpha = 0.2) {


Same thing about prob here

TeemuSailynoja added 2 commits April 29, 2025 15:49

.ppc_calibration_overlay_data, and ppc_calibration_overlay(_grouped)

0e1e446

draft of ppc_calibration plots

d784405

TeemuSailynoja self-assigned this May 19, 2025

TeemuSailynoja added documentation tests new plot labels May 19, 2025

TeemuSailynoja added 2 commits May 22, 2025 14:59

Add example for ppc_calibration_overlay()

5cf62f9

Fix ppc_calibration_grouped()

01ac826

fix typo preventing building doc

f9806eb

jgabry reviewed May 22, 2025

View reviewed changes

TeemuSailynoja added 4 commits June 12, 2025 14:39

Merge branch 'master' of github.com:TeemuSailynoja/bayesplot into add…

4f09a00

…_ppc_calibration

Add ppc_calibration plots to namespace and docs.

5efdf47

Merge branch 'master' of github.com:stan-dev/bayesplot into add_ppc_c…

14eb2dc

…alibration

Sync process. WIP. ISSUE: ppc_calibratrion loses the posterior mean.

a8b4264

avehtari mentioned this pull request Mar 16, 2026

proof of concept for posterior_pit paul-buerkner/brms#1857

Open

florence-bockting and others added 4 commits April 9, 2026 14:43

Merge branch 'master' into add_ppc_calibration

56d6662

update ppc-calibration with consistency and confidence method

293151a

refactor: merge new changes into existing code

bbd03a3

docs: update ppc-calibration

2d30c46

avehtari marked this pull request as ready for review April 13, 2026 13:37

feature: add psis_object to ppc_loo_calibration; adjust helptext size

82986b4

refactor: remove x_scale arg and adj. help_text in ppc-calibration plots

6c665f3

florence-bockting and others added 13 commits April 21, 2026 17:31

Merge branch 'master' into add_ppc_calibration

245dcfa

refactor: remove interval_type argument and adjust corresponding tests

b0ad5f4

docs: add vignette for ppc-calibration plot

789de2f

docs: update function documentation

1778d02

docs: load dplyr in vignette

6c75a20

tests: update snapshots

7cff4da

docs: update vignette ppc-calibration

6cefc64

docs, refactor: update function documentation and refactor ppc_calibr…

f19bde9

…ation_data()

fix: change sorting of ppc-calibration-data

6e9a5ce

tests: remove brms dependent test

a3320c2

tests: remove test with dependency

c6199f7

docs: add ppc-calibration vignette as article-online-only

8aa2d2d

docs: minor adjustments in vignette

21e4d63

jgabry requested changes May 28, 2026

View reviewed changes

Uh oh!

Conversation

TeemuSailynoja commented May 19, 2025 • edited by florence-bockting Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Currently implemented:

Needs:

Uh oh!

codecov-commenter commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TeemuSailynoja commented May 22, 2025

Examples

Creating example data

PAVA Calibration overlay

PAVA Calibration

Uh oh!

jgabry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgabry May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TeemuSailynoja Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TeemuSailynoja commented Jun 3, 2025

Uh oh!

jgabry commented Jun 3, 2025

Uh oh!

florence-bockting commented Apr 9, 2026

Uh oh!

avehtari commented Apr 14, 2026

Uh oh!

florence-bockting commented Apr 15, 2026

Uh oh!

florence-bockting commented Apr 21, 2026

Uh oh!

florence-bockting commented May 8, 2026

Uh oh!

jgabry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TeemuSailynoja commented May 19, 2025 •

edited by florence-bockting

Loading

codecov-commenter commented May 19, 2025 •

edited

Loading

jgabry May 22, 2025 •

edited

Loading

TeemuSailynoja Jun 3, 2025 •

edited

Loading