Post-hoc Probabilistic Vision-Language Models

ICLR 2026

1Technical University of Munich, 2Aalto University, 3Finnish Center for Artificial Intelligence, 4University of Tübingen, 5Helmholtz Munich, 6Munich Center for Machine Learning (MCML), 7Munich Data Science Institute (MDSI), 8KTH Royal Institute of Technology

T L D R :

We present a training-free, post-hoc Bayesian method for uncertainty estimation in vision–language models. It yields interpretable and well-calibrated uncertainties, enables analytic uncertainty propagation, and improves active learning across VLMs.

Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

Contributions

  • We propose BayesVLM, a post-hoc method for uncertainty quantification in pre-trained VLMs without architecture changes or further training.
  • We present a Bayesian formulation for VLMs and derive an analytic approximation of the distribution over cosine similarities (ProbCosine) for efficient uncertainty propagation.
  • We demonstrate BayesVLM’s utility in zero-shot settings and active learning, and assess efficiency and robustness (including proxy-data Hessian estimation).

Methods

BayesVLM is a post-hoc probabilistic VLM: it combines a Bayesian posterior approximation over the final projection layers with uncertainty propagation from probabilistic embeddings to downstream predictions. ProbCosine provides the analytic step that propagates Gaussian embedding uncertainty to a distribution over cosine similarities.

BayesVLM: Laplace Posterior over Projection Layers

We approximate the Bayesian posterior of the final projection matrices (image and text) with a Laplace approximation using a tractable likelihood approximation and a curvature approximation for efficiency (Kronecker-factored GGN).

Illustration of uncertainty propagation in VLMs: We estimate uncertainties over the last linear layers of both encoders using a Laplace approximation, which induces distributions over the feature projections. We then approximate the distribution over cosine similarities by estimating the expected value and variance accordingly. The cosine similarity distribution is then propagated further to the output.

ProbCosine: Analytic Cosine Similarity Uncertainty

ProbCosine analytically propagates uncertainty from (approximately) Gaussian-distributed image and text embeddings to a probabilistic cosine similarity. We moment-match the resulting cosine similarity distribution (mean and variance), enabling efficient uncertainty-aware inference without Monte Carlo sampling.

ProbCosine approximation. Analytic moment-matching closely tracks the cosine similarity distribution across uncertainty levels, providing an efficient alternative to sampling.

Experiments

We evaluate BayesVLM along three questions:

  1. Uncertainty quantification: Does BayesVLM provide reliable uncertainty estimates in zero-shot settings?
  2. Active learning: Can BayesVLM uncertainties select informative fine-tuning data?
  3. Efficiency & robustness: What overhead is introduced, and how robust is BayesVLM (incl. proxy data)?

4.1 Uncertainty Quantification

We show that BayesVLM yields well-calibrated uncertainty estimates in the zero-shot setting, reducing overconfident errors under domain shift. Compared to deterministic VLM baselines, BayesVLM improves uncertainty quality while maintaining competitive predictive performance.

Zero-shot uncertainty quantification (Table 1). BayesVLM achieves comparable accuracy to deterministic CLIP while substantially improving calibration (ECE ↓) and negative log predictive density (NLPD ↓).

Zero-shot calibration (Fig. 2): BayesVLM reduces overconfident predictions and improves calibration.

Uncertainty vs. error before/after active learning. Before active learning (zero-shot), many high-error predictions still receive moderate uncertainty, reflecting overconfidence under shift. After selecting a support set via uncertainty-aware acquisition, errors are reduced and remaining mistakes concentrate in regions of higher uncertainty, showing that BayesVLM’s uncertainty is better aligned with errors and supports more effective sample selection.

4.2 Active Learning

We use BayesVLM’s uncertainty estimates to select informative samples for adaptation with Bayesian acquisition functions. This yields more sample-efficient selection compared to entropy-based and random baselines.

EPIG BALD Entropy (targeted) Entropy Random (targeted) Random

Active learning (Fig. 4): Using BayesVLM uncertainties for EPIG/BALD improves adaptation across subset sizes.

4.3 Efficiency & Robustness

We analyze the computational overhead of BayesVLM and study robustness to design choices such as curvature and proxy-data Hessian estimation. We also illustrate how ProbCosine captures increasing uncertainty under input corruption.

ProbCosine under corruption (Fig. 5): mean similarity decreases and variance increases with input corruption.

BibTeX

@inproceedings{baumann2026bayesvlm,
  title     = {Post-hoc Probabilistic Vision-Language Models},
  author    = {Baumann, Anton and Li, Rui and Klasson, Marcus and Mentu, Santeri and Karthik, Shyamgopal and Akata, Zeynep and Solin, Arno and Trapp, Martin},
  booktitle = {International Conference on Learning Representations {(ICLR)}},
  year      = {2026},
}