Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention
Quick Take
The study introduces a part-factorized Concept Bottleneck Model (CBM) using a frozen DINOv3 vision transformer, achieving 88.85% top-1 accuracy on CUB-200-2011 while improving pointing accuracy by 16 points. This model eliminates the need for per-image supervision, demonstrating that only 0.5% of the training set is sufficient for effective prior initialization.
Key Points
- Part-factorized CBM restricts attention to specific image regions for improved accuracy.
- Achieved 88.85% top-1 accuracy on CUB-200-2011, close to fully supervised baseline.
- Pointing accuracy improved by 16 points, from 36.4% to 52.6%.
- Only 0.5% of training data needed for effective prior initialization.
- Removing part identity leads to a drastic drop in pointing accuracy to 2.9%.
Article Content
From source RSS / original summaryarXiv:2606. 04364v1 Announce Type: new Abstract: Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction.
The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies.
A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88. 85% versus 88. 95% top-1) while raising pointing accuracy by 16 points (52. 6% versus 36. 4%).
Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88. 6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0. 5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2. 9\%$.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Optimal Transport Flow Matching by Design
The study presents a novel approach to optimal transport (OT) flow matching, reformulating the problem by treating the prior as a design choice. This method achieves over 2x reduction in trajectory curvature compared to existing methods, improving generation quality in few-step regimes without altering the flow model. The approach integrates seamlessly with latent-space models and classifier-free guidance.