I. The Mathematical Core

The strength of Conformal Inference (CI) lies in its Finite-Sample Validity. Unlike asymptotic proofs that require \(n \to \infty\), CI guarantees hold for any sample size \(n\).

1. The Principle of Exchangeability

The fundamental assumption underlying Conformal Inference is that the data sequence \(Z_1, \dots, Z_{n+1}\) (where \(Z_i = (X_i, Y_i)\)) is exchangeable.

Definition: A sequence is exchangeable if its joint probability distribution is invariant under permutations. This is a weaker (more general) assumption than being Independent and Identically Distributed (IID).

2. The Non-Conformity Score (\(S\))

We define a score function \(V(z, \mathcal{D})\) that measures how “unusual” a data point \(z\) is relative to a dataset \(\mathcal{D}\). Common score functions include:

For Regression: \[S_i = |y_i - \hat{\mu}(x_i)|\]

For Classification: \[S_i = 1 - \hat{\pi}(x_i)_{y_i}\]

For Conformalized Quantile Regression (CQR): \[S_i = \max(\hat{q}_{low}(x_i) - y_i, \ y_i - \hat{q}_{high}(x_i))\]

3. The Conformal Quantile

In the Split Conformal framework, we calculate scores on a calibration set of size \(n\). To construct a \(1-\alpha\) confidence interval, we define the threshold \(\hat{q}\) as:

\[\hat{q} = \text{The } \frac{\lceil (n+1)(1-\alpha) \rceil}{n} \text{ quantile of } \{S_1, \dots, S_n\}\]

The \(+1\) in the numerator is the critical conformal correction. It mathematically accounts for the fact that the unknown future point \(Z_{n+1}\) is itself a member of the exchangeable sequence.

4. The Prediction Set

The prediction set for a new point \(x_{n+1}\) is constructed by collecting all possible values of \(y\) that result in a score no greater than the threshold \(\hat{q}\):

\[C(x_{n+1}) = \{ y : V((x_{n+1}, y), \mathcal{D}_{train}) \leq \hat{q} \}\]

The Guarantee: Under the sole assumption of exchangeability, it is mathematically proven that for any distribution and any model:

\[P(Y_{n+1} \in C(x_{n+1})) \geq 1 - \alpha\]

Theoretical Derivation: Why the \((n+1)\) Correction?

The validity of Conformal Inference relies on a simple yet profound observation about the ranks of exchangeable random variables.

1. The Symmetry Lemma

Assume we have \(n\) calibration scores \(\{S_1, \dots, S_n\}\) and one future test score \(S_{n+1}\). If the underlying data points are exchangeable, then their corresponding non-conformity scores are also exchangeable.

By symmetry, the future score \(S_{n+1}\) is equally likely to fall into any of the “bins” created by the sorted calibration scores. With \(n\) calibration points, there are \(n+1\) possible slots where \(S_{n+1}\) could land:

Below the smallest score \(S_{(1)}\)
Between \(S_{(1)}\) and \(S_{(2)}\)
…
Above the largest score \(S_{(n)}\)

Mathematically, for any \(j \in \{1, \dots, n+1\}\): \[P\left( \text{Rank}(S_{n+1}) = j \right) = \frac{1}{n+1}\]

2. Defining the Coverage

We want to ensure that \(S_{n+1}\) is not “too large.” Specifically, we want to find a threshold \(\hat{q}\) such that \(S_{n+1} \leq \hat{q}\) with probability at least \(1-\alpha\).

If we sort our calibration scores as \(S_{(1)} \leq S_{(2)} \leq \dots \leq S_{(n)}\), then \(S_{n+1}\) is less than or equal to the \(k\)-th smallest calibration score \(S_{(k)}\) if its rank is among the first \(k\) positions:

\[P(S_{n+1} \leq S_{(k)}) = \frac{k}{n+1}\]

3. Solving for \(k\)

To satisfy the coverage guarantee \(P(S_{n+1} \leq \hat{q}) \geq 1-\alpha\), we set: \[\frac{k}{n+1} \geq 1 - \alpha\]

Solving for \(k\): \[k \geq (n+1)(1-\alpha)\]

Since \(k\) must be an integer, we take the ceiling: \[k = \lceil (n+1)(1-\alpha) \rceil\]

4. The Final Threshold

The threshold \(\hat{q}\) is therefore the \(k\)-th smallest value (the \((k/n)\)-th quantile) of the calibration set: \[\hat{q} = \text{Quantile}\left( \{S_1, \dots, S_n\}, \frac{\lceil (n+1)(1-\alpha) \rceil}{n} \right)\]

Summary of the Logic

Without the \(+1\): We would be assuming the test point is not part of the distribution we are measuring.
With the \(+1\): We acknowledge that the test point is the \((n+1)\)-th member of the sequence, making the probability bound exact rather than heuristic.

Common Pitfalls and Implementation Risks

Even though Conformal Inference is “model-agnostic,” the validity of the \(1-\alpha\) guarantee depends on strict procedural discipline. Below are the most frequent errors encountered in academic and industrial settings.

1. Data Leakage (Reusing Training Data)

The most common mistake is calculating non-conformity scores on the same data used to train the model (\(\mu\)).

The Error: If you calculate scores on the training set, the residuals will be artificially small because the model has “memorized” those points.
The Consequence: The threshold \(\hat{q}\) will be too low, and the intervals will fail to cover new data points at the promised \(1-\alpha\) rate.
The Fix: Always use a “held-out” Calibration Set that the model has never seen during the training phase.

2. Violating Exchangeability (Distribution Shift)

Conformal Inference assumes that \((X_{1}, Y_{1}), \dots, (X_{n+1}, Y_{n+1})\) are exchangeable.

The Error: Applying a threshold calculated on data from 2023 to data collected in 2026 where the underlying relationship has changed.
The Consequence: If the test data is no longer “like” the calibration data, the \(1-\alpha\) guarantee evaporates.
The Fix: In time-series or shifting environments, use Weighted Conformal Prediction or Online Conformal Prediction, which give more weight to recent observations.

3. The “Average Coverage” Trap

It is a common misconception that 95% coverage means the model is 95% accurate for every individual.

The Error: Assuming Conditional Coverage when you only have Marginal Coverage.
The Reality: Standard Split Conformal guarantees that on average across the population, 95% of points will be covered. However, for a specific sub-group (e.g., a specific age range), the coverage might only be 80% while another group is 100%.
The Fix: To improve local reliability, use Locally Adaptive methods like CQR (Conformalized Quantile Regression).

4. Selecting \(\alpha\) After Seeing Results

The Error: “P-hacking” the confidence level. Looking at the intervals and deciding to change \(\alpha\) from 0.05 to 0.10 because the intervals look “too wide.”
The Consequence: This violates the pre-specified nature of the guarantee.
The Fix: Define the tolerance for error (\(\alpha\)) based on the risk profile of the application before running the calibration.

5. Over-interpreting the “Model-Agnostic” Property

The Error: Thinking that because CI works with “any” model, the choice of model doesn’t matter.
The Reality: While any model will give valid intervals (correct coverage), a poor model will give useless intervals (extremely wide or infinite sets).
The Fix: Spend time tuning your underlying point model or quantile regressor. Conformal Inference makes a model reliable, but only good feature engineering makes it efficient.

Summary Checklist for a Valid Implementation

Was the calibration set strictly separated from the training set?
Is the data reasonably exchangeable (no sudden temporal shifts)?
Was the \((n+1)\) finite-sample correction applied to the quantile?
Does the choice of score function (\(S\)) match the problem (Regression vs. Classification)?

II. Part 1: Split-Conformal Regression

Constant-width intervals for standard regression.

1. Data Generation

We create data where the relationship is quadratic, but the noise is constant.

set.seed(1)
n <- 1000
x <- runif(n, 0, 5)
y <- x^2 + rnorm(n, sd = 1.5) # Constant noise

# 50/50 Split for Training and Calibration
idx <- sample(1:n, n/2)
train_reg <- data.frame(x=x[idx], y=y[idx])
calib_reg <- data.frame(x=x[-idx], y=y[-idx])

2. Training and Score Calculation

The Non-conformity Score () for regression is the absolute residual.

# Train model
rf_mu <- randomForest(y ~ x, data = train_reg)

# Calculate scores on calibration set
mu_calib <- predict(rf_mu, newdata = calib_reg)
scores_reg <- abs(calib_reg$y - mu_calib)

# Calculate Conformal Quantile (with finite sample correction)
alpha <- 0.10
n_cal <- length(scores_reg)
q_hat_reg <- quantile(scores_reg, probs = ceiling((n_cal + 1) * (1 - alpha)) / n_cal)

3. Visualization

test_x <- seq(0, 5, length.out = 100)
pred_y <- predict(rf_mu, newdata = data.frame(x = test_x))

ggplot() +
  geom_point(data = calib_reg, aes(x, y), alpha = 0.2) +
  geom_line(aes(test_x, pred_y), color = "blue", size = 1) +
  geom_ribbon(aes(x = test_x, ymin = pred_y - q_hat_reg, ymax = pred_y + q_hat_reg), 
              fill = "blue", alpha = 0.2) +
  labs(title = "Split Conformal Regression", subtitle = "Constant Width Intervals") +
  theme_minimal()

III. Part 2: Conformalized Quantile Regression (CQR)

Adaptive-width intervals for heteroscedastic data.

1. Data with Variable Noise

# Noise increases with x
y_het <- x^2 + rnorm(n, sd = 0.5 * x)
train_cqr <- data.frame(x=x[idx], y=y_het[idx])
calib_cqr <- data.frame(x=x[-idx], y=y_het[-idx])

2. The CQR Score

We train a model to predict the 5th and 95th percentiles and calibrate the “signed distance” to the true value.

# Train Quantile Forest
qrf <- quantregForest(x = as.matrix(train_cqr$x), y = train_cqr$y)

# Get calibration quantiles
cal_q <- predict(qrf, newdata = as.matrix(calib_cqr$x), what = c(0.05, 0.95))

# CQR Score: distance outside predicted range
scores_cqr <- pmax(cal_q[,1] - calib_cqr$y, calib_cqr$y - cal_q[,2])
q_hat_cqr <- quantile(scores_cqr, probs = ceiling((n_cal + 1) * (1 - alpha)) / n_cal)

3. Adaptive Visualization

test_q <- predict(qrf, newdata = as.matrix(test_x), what = c(0.05, 0.95))

ggplot() +
  geom_point(data = calib_cqr, aes(x, y), alpha = 0.2) +
  geom_ribbon(aes(x = test_x, ymin = test_q[,1] - q_hat_cqr, ymax = test_q[,2] + q_hat_cqr), 
              fill = "darkgreen", alpha = 0.3) +
  labs(title = "CQR: Adaptive Intervals", subtitle = "Width adjusts to local noise levels") +
  theme_minimal()

Adaptive intervals widen where noise is higher, demonstrating the strength of CQR in heteroscedastic settings. Here we get the values for new predicted points:

new_points <- data.frame(x = c(1, 2, 3, 4))
pred_cqr <- predict(qrf, newdata = as.matrix(new_points$x), what = c(0.05, 0.95))
cqr_intervals <- data.frame(
  Lower = pred_cqr[,1] - q_hat_cqr,
  Upper = pred_cqr[,2] + q_hat_cqr
)
print(cqr_intervals)

##        Lower     Upper
## 1 -0.6801017  2.273845
## 2  1.1458447  5.675685
## 3  6.0120175 10.998667
## 4 14.2773698 20.499245

IV. Part 3: Conformal Classification

Predicting statistically valid ‘Sets’ of labels.

1. The Classifier

Using the iris dataset with added noise to demonstrate uncertainty.

data(iris)
iris_n <- iris
iris_n[,1:4] <- iris_n[,1:4] + rnorm(150*4, sd = 1.0) # Add noise

# Split
c_idx <- sample(1:150, 100)
c_train <- iris_n[c_idx, ]
c_calib <- iris_n[-c_idx, ]

# Fit Model
class_model <- multinom(Species ~ ., data = c_train, trace = FALSE)

2. Calibration and Set Generation

The score is .

cal_probs <- predict(class_model, newdata = c_calib, type = "probs")

# Probability of actual correct labels
true_probs <- sapply(1:nrow(c_calib), function(i) cal_probs[i, as.character(c_calib$Species[i])])
scores_class <- 1 - true_probs

q_hat_class <- quantile(scores_class, probs = ceiling((nrow(c_calib)+1)*0.9)/nrow(c_calib))

# Create sets for new points: all classes where P(class) >= 1 - q_hat
test_probs <- predict(class_model, newdata = head(iris_n), type = "probs")
pred_sets <- apply(test_probs, 1, function(p) names(p)[p >= (1 - q_hat_class)])

print(pred_sets)

## $`1`
## [1] "setosa"
## 
## $`2`
## [1] "setosa"
## 
## $`3`
## [1] "setosa"
## 
## $`4`
## [1] "setosa"
## 
## $`5`
## [1] "setosa"
## 
## $`6`
## [1] "setosa"     "versicolor"

3. Example with clase 0=alive, 1=dead

# Simulated binary classification data
set.seed(42)
n_bin <- 200
x1 <- rnorm(n_bin)
x2 <- rnorm(n_bin)
y_bin <- ifelse(x1 + x2 + rnorm(n_bin) > 0, 1, 0)
bin_data <- data.frame(x1 = x1, x2 = x2, y = as.factor(y_bin))
# Split
bin_idx <- sample(1:n_bin, n_bin * 0.7)
bin_train <- bin_data[bin_idx, ]
bin_calib <- bin_data[-bin_idx, ]
# Fit logistic regression
bin_model <- glm(y ~ x1 + x2, data = bin_train, family = binomial)
# Calibration
bin_calib_probs <- predict(bin_model, newdata = bin_calib, type = "response")
bin_true_probs <- ifelse(bin_calib$y == 1, bin_calib_probs, 1 - bin_calib_probs)
bin_scores <- 1 - bin_true_probs
bin_q_hat <- quantile(bin_scores, probs = ceiling((nrow(bin_calib)+1)*0.9)/nrow(bin_calib))
# Define new test points
new_bin_points <- data.frame(x1 = c(0.5, -0.5), x2 = c(0.5, -0.5))
# Generate probabilities for the positive class (Class "1")
new_bin_probs <- predict(bin_model, newdata = new_bin_points, type = "response")
# Construct Prediction Sets
# A class is included if its probability is at least (1 - q_hat)
bin_pred_sets <- lapply(new_bin_probs, function(p) {
  classes <- c()
  # Check Class "1"
  if (p >= (1 - bin_q_hat)) {
    classes <- c(classes, "1")
  }
  # Check Class "0"
  if ((1 - p) >= (1 - bin_q_hat)) {
    classes <- c(classes, "0")
  }
  return(classes)
})
# Display the results
names(bin_pred_sets) <- paste0("Point_", 1:length(bin_pred_sets))
print(bin_pred_sets)

## $Point_1
## [1] "1"
## 
## $Point_2
## [1] "0"

4. Empirical Evaluation of Binary Conformal Sets

To verify the mathematical guarantee, we calculate the coverage on the test set. For a 90% confidence level (\(\alpha = 0.10\)), the empirical coverage should be \(\geq 90\%\).

# Generate test data and probabilities
test_bin_probs <- predict(bin_model, newdata = bin_calib, type = "response")
# Construct sets for the entire test set
test_pred_sets <- lapply(test_bin_probs, function(p) {
  classes <- c()
  if (p >= (1 - bin_q_hat)) classes <- c(classes, "1")
  if ((1 - p) >= (1 - bin_q_hat)) classes <- c(classes, "0")
  return(classes)
})

# Calculate Empirical Coverage
# Check if the actual true label (test_df$y) is inside the prediction set
is_covered <- sapply(1:nrow(bin_calib), function(i) {
  as.character(bin_calib$y[i]) %in% test_pred_sets[[i]]
})

empirical_coverage <- mean(is_covered)

# Calculate Average Set Size (Efficiency)
set_sizes <- sapply(test_pred_sets, length)
avg_set_size <- mean(set_sizes)

# Summary Table
evaluation_summary <- data.frame(
  Metric = c("Target Coverage", "Empirical Coverage", "Avg. Set Size", "Empty Sets (%)"),
  Value = c(
    1 - alpha, 
    empirical_coverage, 
    avg_set_size,
    mean(set_sizes == 0) * 100
  )
)

print(evaluation_summary)

##               Metric     Value
## 1    Target Coverage 0.9000000
## 2 Empirical Coverage 0.9166667
## 3      Avg. Set Size 1.3333333
## 4     Empty Sets (%) 0.0000000

Visualization of Set Sizes vs. Predicted Probabilities:

# Visualize Set Size vs. Probability
# This helps students see that ambiguity (p near 0.5) leads to larger sets
plot_df <- data.frame(prob = test_bin_probs, size = set_sizes)
ggplot(plot_df, aes(x = prob, y = size)) +
  geom_jitter(height = 0.1, alpha = 0.4, color = "darkblue") +
  geom_vline(xintercept = c(1 - bin_q_hat, bin_q_hat), linetype = "dashed", color = "red") +
  labs(title = "Set Size vs. Predicted Probability",
       subtitle = "Dashed lines indicate the conformal inclusion thresholds",
       x = "Predicted Probability P(Y=1)", y = "Number of Labels in Set") +
  theme_minimal()

V. Summary and Conclusion

The following table summarizes the three main Conformal Inference frameworks covered in this module:

Method	Output Type	Score Function (\(S\))	Best Used When…
Split Regression	Constant Interval	\(\|y_i - \hat{\mu}(x_i)\|\)	Noise is uniform (homoscedastic).
CQR	Adaptive Interval	\(\max(\hat{q}_L - y_i, y_i - \hat{q}_H)\)	Noise varies by \(X\) (heteroscedastic).
Classification	Prediction Set	\(1 - \hat{\pi}(x_i)_{y_{true}}\)	High-stakes categorical decisions.

The power of Conformal Inference lies in its honesty. If the model is bad, the intervals/sets will simply get larger, but they will never lie about their coverage. Regardless of which model or score function you choose, the following mathematical guarantee holds:

1. The Probability Bound

Given a user-defined error tolerance \(\alpha\) (e.g., \(\alpha = 0.10\) for 90% confidence), and assuming the data is exchangeable, the prediction set \(C(X_{n+1})\) satisfies:

\[P(Y_{n+1} \in C(X_{n+1})) \geq 1 - \alpha\]

2. Interpreting the Bound

Finite Sample: This is not an “asymptotic” result. It works for \(n=50\) just as well as \(n=50,000\).
Model Agnostic: The guarantee holds even if your underlying model (\(\hat{\mu}\) or \(\hat{\pi}\)) is biased or poorly tuned.
Reliability vs. Efficiency: While the coverage (1 - \(\alpha\)) is always guaranteed, a better model will result in tighter (more efficient) intervals or smaller prediction sets.

3. Key Requirement: Exchangeability

The only major mathematical requirement is that your calibration data and your new test data come from the same process. Note: If the world changes (distribution shift), the guarantee may no longer hold unless you use “Weighted” or “Adaptive” Conformal methods.

VI. Primary Academic References

These references are categorized by their role in the development of the field.

The Foundations (The “Vovk” Era)

1.Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.

Modern Split-Conformal & Regression

2.Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). “Distribution-free predictive inference for regression.” Journal of the American Statistical Association.

3.Romano, Y., Patterson, E., & Candès, E. (2019). “Conformalized Quantile Regression.” Advances in Neural Information Processing Systems (NeurIPS).

Modern Classification & Sets

4.Angelopoulos, A. N., & Bates, S. (2021). “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” arXiv preprint.

5.Sadinle, M., Lei, J., & Wasserman, L. (2019). “Least ambiguous set-valued classifiers with bounded error levels.” Journal of the American Statistical Association.

Robustness & Distribution Shift

6.Tibshirani, R. J., Foygel Barber, R., Candès, E., & Ramdas, A. (2019). “Conformal Prediction Under Covariate Shift.” Advances in Neural Information Processing Systems (NeurIPS).

The Comprehensive Guide to Conformal Statistical Inference

Uncertainty Quantification for Regression and Classification

Miguel Angel Luque Fernandez, PhD Advanced Statistical Computing Lab Department of Statistics and Operations Research Division of Biostatistics, Faculty of Medicine University of Granada, Spain mluquefe at ugr.es

2026-02-04