mluquefe at ugr.esThe strength of Conformal Inference (CI) lies in its Finite-Sample Validity. Unlike asymptotic proofs that require \(n \to \infty\), CI guarantees hold for any sample size \(n\).
The fundamental assumption underlying Conformal Inference is that the data sequence \(Z_1, \dots, Z_{n+1}\) (where \(Z_i = (X_i, Y_i)\)) is exchangeable.
Definition: A sequence is exchangeable if its joint probability distribution is invariant under permutations. This is a weaker (more general) assumption than being Independent and Identically Distributed (IID).
We define a score function \(V(z, \mathcal{D})\) that measures how “unusual” a data point \(z\) is relative to a dataset \(\mathcal{D}\). Common score functions include:
For Regression: \[S_i = |y_i - \hat{\mu}(x_i)|\]
For Classification: \[S_i = 1 - \hat{\pi}(x_i)_{y_i}\]
For Conformalized Quantile Regression (CQR): \[S_i = \max(\hat{q}_{low}(x_i) - y_i, \ y_i - \hat{q}_{high}(x_i))\]
In the Split Conformal framework, we calculate scores on a calibration set of size \(n\). To construct a \(1-\alpha\) confidence interval, we define the threshold \(\hat{q}\) as:
\[\hat{q} = \text{The } \frac{\lceil (n+1)(1-\alpha) \rceil}{n} \text{ quantile of } \{S_1, \dots, S_n\}\]
The \(+1\) in the numerator is the critical conformal correction. It mathematically accounts for the fact that the unknown future point \(Z_{n+1}\) is itself a member of the exchangeable sequence.
The prediction set for a new point \(x_{n+1}\) is constructed by collecting all possible values of \(y\) that result in a score no greater than the threshold \(\hat{q}\):
\[C(x_{n+1}) = \{ y : V((x_{n+1}, y), \mathcal{D}_{train}) \leq \hat{q} \}\]
The Guarantee: Under the sole assumption of exchangeability, it is mathematically proven that for any distribution and any model:
\[P(Y_{n+1} \in C(x_{n+1})) \geq 1 - \alpha\]
The validity of Conformal Inference relies on a simple yet profound observation about the ranks of exchangeable random variables.
Assume we have \(n\) calibration scores \(\{S_1, \dots, S_n\}\) and one future test score \(S_{n+1}\). If the underlying data points are exchangeable, then their corresponding non-conformity scores are also exchangeable.
By symmetry, the future score \(S_{n+1}\) is equally likely to fall into any of the “bins” created by the sorted calibration scores. With \(n\) calibration points, there are \(n+1\) possible slots where \(S_{n+1}\) could land:
Mathematically, for any \(j \in \{1, \dots, n+1\}\): \[P\left( \text{Rank}(S_{n+1}) = j \right) = \frac{1}{n+1}\]
We want to ensure that \(S_{n+1}\) is not “too large.” Specifically, we want to find a threshold \(\hat{q}\) such that \(S_{n+1} \leq \hat{q}\) with probability at least \(1-\alpha\).
If we sort our calibration scores as \(S_{(1)} \leq S_{(2)} \leq \dots \leq S_{(n)}\), then \(S_{n+1}\) is less than or equal to the \(k\)-th smallest calibration score \(S_{(k)}\) if its rank is among the first \(k\) positions:
\[P(S_{n+1} \leq S_{(k)}) = \frac{k}{n+1}\]
To satisfy the coverage guarantee \(P(S_{n+1} \leq \hat{q}) \geq 1-\alpha\), we set: \[\frac{k}{n+1} \geq 1 - \alpha\]
Solving for \(k\): \[k \geq (n+1)(1-\alpha)\]
Since \(k\) must be an integer, we take the ceiling: \[k = \lceil (n+1)(1-\alpha) \rceil\]
The threshold \(\hat{q}\) is therefore the \(k\)-th smallest value (the \((k/n)\)-th quantile) of the calibration set: \[\hat{q} = \text{Quantile}\left( \{S_1, \dots, S_n\}, \frac{\lceil (n+1)(1-\alpha) \rceil}{n} \right)\]
Even though Conformal Inference is “model-agnostic,” the validity of the \(1-\alpha\) guarantee depends on strict procedural discipline. Below are the most frequent errors encountered in academic and industrial settings.
The most common mistake is calculating non-conformity scores on the same data used to train the model (\(\mu\)).
Conformal Inference assumes that \((X_{1}, Y_{1}), \dots, (X_{n+1}, Y_{n+1})\) are exchangeable.
It is a common misconception that 95% coverage means the model is 95% accurate for every individual.
Constant-width intervals for standard regression.
We create data where the relationship is quadratic, but the noise is constant.
set.seed(1)
n <- 1000
x <- runif(n, 0, 5)
y <- x^2 + rnorm(n, sd = 1.5) # Constant noise
# 50/50 Split for Training and Calibration
idx <- sample(1:n, n/2)
train_reg <- data.frame(x=x[idx], y=y[idx])
calib_reg <- data.frame(x=x[-idx], y=y[-idx])
The Non-conformity Score () for regression is the absolute residual.
# Train model
rf_mu <- randomForest(y ~ x, data = train_reg)
# Calculate scores on calibration set
mu_calib <- predict(rf_mu, newdata = calib_reg)
scores_reg <- abs(calib_reg$y - mu_calib)
# Calculate Conformal Quantile (with finite sample correction)
alpha <- 0.10
n_cal <- length(scores_reg)
q_hat_reg <- quantile(scores_reg, probs = ceiling((n_cal + 1) * (1 - alpha)) / n_cal)
test_x <- seq(0, 5, length.out = 100)
pred_y <- predict(rf_mu, newdata = data.frame(x = test_x))
ggplot() +
geom_point(data = calib_reg, aes(x, y), alpha = 0.2) +
geom_line(aes(test_x, pred_y), color = "blue", size = 1) +
geom_ribbon(aes(x = test_x, ymin = pred_y - q_hat_reg, ymax = pred_y + q_hat_reg),
fill = "blue", alpha = 0.2) +
labs(title = "Split Conformal Regression", subtitle = "Constant Width Intervals") +
theme_minimal()
Adaptive-width intervals for heteroscedastic data.
# Noise increases with x
y_het <- x^2 + rnorm(n, sd = 0.5 * x)
train_cqr <- data.frame(x=x[idx], y=y_het[idx])
calib_cqr <- data.frame(x=x[-idx], y=y_het[-idx])
We train a model to predict the 5th and 95th percentiles and calibrate the “signed distance” to the true value.
# Train Quantile Forest
qrf <- quantregForest(x = as.matrix(train_cqr$x), y = train_cqr$y)
# Get calibration quantiles
cal_q <- predict(qrf, newdata = as.matrix(calib_cqr$x), what = c(0.05, 0.95))
# CQR Score: distance outside predicted range
scores_cqr <- pmax(cal_q[,1] - calib_cqr$y, calib_cqr$y - cal_q[,2])
q_hat_cqr <- quantile(scores_cqr, probs = ceiling((n_cal + 1) * (1 - alpha)) / n_cal)
test_q <- predict(qrf, newdata = as.matrix(test_x), what = c(0.05, 0.95))
ggplot() +
geom_point(data = calib_cqr, aes(x, y), alpha = 0.2) +
geom_ribbon(aes(x = test_x, ymin = test_q[,1] - q_hat_cqr, ymax = test_q[,2] + q_hat_cqr),
fill = "darkgreen", alpha = 0.3) +
labs(title = "CQR: Adaptive Intervals", subtitle = "Width adjusts to local noise levels") +
theme_minimal()
Adaptive intervals widen where noise is higher, demonstrating the strength of CQR in heteroscedastic settings. Here we get the values for new predicted points:
new_points <- data.frame(x = c(1, 2, 3, 4))
pred_cqr <- predict(qrf, newdata = as.matrix(new_points$x), what = c(0.05, 0.95))
cqr_intervals <- data.frame(
Lower = pred_cqr[,1] - q_hat_cqr,
Upper = pred_cqr[,2] + q_hat_cqr
)
print(cqr_intervals)
## Lower Upper
## 1 -0.6801017 2.273845
## 2 1.1458447 5.675685
## 3 6.0120175 10.998667
## 4 14.2773698 20.499245
Predicting statistically valid ‘Sets’ of labels.
Using the iris dataset with added noise to demonstrate
uncertainty.
data(iris)
iris_n <- iris
iris_n[,1:4] <- iris_n[,1:4] + rnorm(150*4, sd = 1.0) # Add noise
# Split
c_idx <- sample(1:150, 100)
c_train <- iris_n[c_idx, ]
c_calib <- iris_n[-c_idx, ]
# Fit Model
class_model <- multinom(Species ~ ., data = c_train, trace = FALSE)
The score is .
cal_probs <- predict(class_model, newdata = c_calib, type = "probs")
# Probability of actual correct labels
true_probs <- sapply(1:nrow(c_calib), function(i) cal_probs[i, as.character(c_calib$Species[i])])
scores_class <- 1 - true_probs
q_hat_class <- quantile(scores_class, probs = ceiling((nrow(c_calib)+1)*0.9)/nrow(c_calib))
# Create sets for new points: all classes where P(class) >= 1 - q_hat
test_probs <- predict(class_model, newdata = head(iris_n), type = "probs")
pred_sets <- apply(test_probs, 1, function(p) names(p)[p >= (1 - q_hat_class)])
print(pred_sets)
## $`1`
## [1] "setosa"
##
## $`2`
## [1] "setosa"
##
## $`3`
## [1] "setosa"
##
## $`4`
## [1] "setosa"
##
## $`5`
## [1] "setosa"
##
## $`6`
## [1] "setosa" "versicolor"
# Simulated binary classification data
set.seed(42)
n_bin <- 200
x1 <- rnorm(n_bin)
x2 <- rnorm(n_bin)
y_bin <- ifelse(x1 + x2 + rnorm(n_bin) > 0, 1, 0)
bin_data <- data.frame(x1 = x1, x2 = x2, y = as.factor(y_bin))
# Split
bin_idx <- sample(1:n_bin, n_bin * 0.7)
bin_train <- bin_data[bin_idx, ]
bin_calib <- bin_data[-bin_idx, ]
# Fit logistic regression
bin_model <- glm(y ~ x1 + x2, data = bin_train, family = binomial)
# Calibration
bin_calib_probs <- predict(bin_model, newdata = bin_calib, type = "response")
bin_true_probs <- ifelse(bin_calib$y == 1, bin_calib_probs, 1 - bin_calib_probs)
bin_scores <- 1 - bin_true_probs
bin_q_hat <- quantile(bin_scores, probs = ceiling((nrow(bin_calib)+1)*0.9)/nrow(bin_calib))
# Define new test points
new_bin_points <- data.frame(x1 = c(0.5, -0.5), x2 = c(0.5, -0.5))
# Generate probabilities for the positive class (Class "1")
new_bin_probs <- predict(bin_model, newdata = new_bin_points, type = "response")
# Construct Prediction Sets
# A class is included if its probability is at least (1 - q_hat)
bin_pred_sets <- lapply(new_bin_probs, function(p) {
classes <- c()
# Check Class "1"
if (p >= (1 - bin_q_hat)) {
classes <- c(classes, "1")
}
# Check Class "0"
if ((1 - p) >= (1 - bin_q_hat)) {
classes <- c(classes, "0")
}
return(classes)
})
# Display the results
names(bin_pred_sets) <- paste0("Point_", 1:length(bin_pred_sets))
print(bin_pred_sets)
## $Point_1
## [1] "1"
##
## $Point_2
## [1] "0"
To verify the mathematical guarantee, we calculate the coverage on the test set. For a 90% confidence level (\(\alpha = 0.10\)), the empirical coverage should be \(\geq 90\%\).
# Generate test data and probabilities
test_bin_probs <- predict(bin_model, newdata = bin_calib, type = "response")
# Construct sets for the entire test set
test_pred_sets <- lapply(test_bin_probs, function(p) {
classes <- c()
if (p >= (1 - bin_q_hat)) classes <- c(classes, "1")
if ((1 - p) >= (1 - bin_q_hat)) classes <- c(classes, "0")
return(classes)
})
# Calculate Empirical Coverage
# Check if the actual true label (test_df$y) is inside the prediction set
is_covered <- sapply(1:nrow(bin_calib), function(i) {
as.character(bin_calib$y[i]) %in% test_pred_sets[[i]]
})
empirical_coverage <- mean(is_covered)
# Calculate Average Set Size (Efficiency)
set_sizes <- sapply(test_pred_sets, length)
avg_set_size <- mean(set_sizes)
# Summary Table
evaluation_summary <- data.frame(
Metric = c("Target Coverage", "Empirical Coverage", "Avg. Set Size", "Empty Sets (%)"),
Value = c(
1 - alpha,
empirical_coverage,
avg_set_size,
mean(set_sizes == 0) * 100
)
)
print(evaluation_summary)
## Metric Value
## 1 Target Coverage 0.9000000
## 2 Empirical Coverage 0.9166667
## 3 Avg. Set Size 1.3333333
## 4 Empty Sets (%) 0.0000000
Visualization of Set Sizes vs. Predicted Probabilities:
# Visualize Set Size vs. Probability
# This helps students see that ambiguity (p near 0.5) leads to larger sets
plot_df <- data.frame(prob = test_bin_probs, size = set_sizes)
ggplot(plot_df, aes(x = prob, y = size)) +
geom_jitter(height = 0.1, alpha = 0.4, color = "darkblue") +
geom_vline(xintercept = c(1 - bin_q_hat, bin_q_hat), linetype = "dashed", color = "red") +
labs(title = "Set Size vs. Predicted Probability",
subtitle = "Dashed lines indicate the conformal inclusion thresholds",
x = "Predicted Probability P(Y=1)", y = "Number of Labels in Set") +
theme_minimal()
The following table summarizes the three main Conformal Inference frameworks covered in this module:
| Method | Output Type | Score Function (\(S\)) | Best Used When… |
|---|---|---|---|
| Split Regression | Constant Interval | \(|y_i - \hat{\mu}(x_i)|\) | Noise is uniform (homoscedastic). |
| CQR | Adaptive Interval | \(\max(\hat{q}_L - y_i, y_i - \hat{q}_H)\) | Noise varies by \(X\) (heteroscedastic). |
| Classification | Prediction Set | \(1 - \hat{\pi}(x_i)_{y_{true}}\) | High-stakes categorical decisions. |
The power of Conformal Inference lies in its honesty. If the model is bad, the intervals/sets will simply get larger, but they will never lie about their coverage. Regardless of which model or score function you choose, the following mathematical guarantee holds:
Given a user-defined error tolerance \(\alpha\) (e.g., \(\alpha = 0.10\) for 90% confidence), and assuming the data is exchangeable, the prediction set \(C(X_{n+1})\) satisfies:
\[P(Y_{n+1} \in C(X_{n+1})) \geq 1 - \alpha\]
The only major mathematical requirement is that your calibration data and your new test data come from the same process. Note: If the world changes (distribution shift), the guarantee may no longer hold unless you use “Weighted” or “Adaptive” Conformal methods.
These references are categorized by their role in the development of the field.
1.Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
2.Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). “Distribution-free predictive inference for regression.” Journal of the American Statistical Association.
3.Romano, Y., Patterson, E., & Candès, E. (2019). “Conformalized Quantile Regression.” Advances in Neural Information Processing Systems (NeurIPS).
4.Angelopoulos, A. N., & Bates, S. (2021). “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” arXiv preprint.
5.Sadinle, M., Lei, J., & Wasserman, L. (2019). “Least ambiguous set-valued classifiers with bounded error levels.” Journal of the American Statistical Association.
6.Tibshirani, R. J., Foygel Barber, R., Candès, E., & Ramdas, A. (2019). “Conformal Prediction Under Covariate Shift.” Advances in Neural Information Processing Systems (NeurIPS).