Concepts Worth Having: Refining VLM-Guided Concept Bottleneck Models with Minimal Annotations
Quick Take
VH-CBM enhances concept-bottleneck models using minimal annotations and VLMs for improved interpretability.
Key Points
- Combines VLMs with minimal dense annotations.
- Utilizes Gaussian Process for better concept prediction.
- Achieves higher accuracy with only 1% annotated data.
📖 Reader Mode
~2 min readAbstract:Concept-bottleneck models (CBMs) are neural classifiers that compute predictions from high-level concepts extracted from the input. CBMs ensure stakeholders can understand the concepts -- and the predictions they entail -- by learning these from concept-level annotations, which are however seldom available. Recent CBM architectures work around this issue by obtaining annotations from Vision-Language Models (VLMs). While greatly broadening applicability, doing so can yield lower quality concepts and therefore less interpretable models. We strike for a middle ground by introducing Vision-plus-Human-guided CBM (VH-CBM), a hybrid approach that exploits both VLMs and a small amount of dense annotations. VH-CBM employs a Gaussian Process in the VLM's embedding space, which captures useful global information about the target domain, to propagate the expert's supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2605.16405 [cs.CV] |
| (or arXiv:2605.16405v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.16405 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Nicola Debole [view email]
[v1]
Wed, 13 May 2026 10:07:11 UTC (2,363 KB)
— Originally published at arxiv.org
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.