Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

arXiv cs.CV·Dhanesh Ramachandram

3h ago

·~2 min·6/4/2026·en·0

Quick Take

The study introduces a part-factorized Concept Bottleneck Model (CBM) using a frozen DINOv3 vision transformer, achieving 88.85% top-1 accuracy on CUB-200-2011 while improving pointing accuracy by 16 points. This model eliminates the need for per-image supervision, demonstrating that only 0.5% of the training set is sufficient for effective prior initialization.

Key Points

Part-factorized CBM restricts attention to specific image regions for improved accuracy.
Achieved 88.85% top-1 accuracy on CUB-200-2011, close to fully supervised baseline.
Pointing accuracy improved by 16 points, from 36.4% to 52.6%.
Only 0.5% of training data needed for effective prior initialization.
Removing part identity leads to a drastic drop in pointing accuracy to 2.9%.

Article Content

From source RSS / original summary

arXiv:2606. 04364v1 Announce Type: new Abstract: Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction.

The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies.

A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88. 85% versus 88. 95% top-1) while raising pointing accuracy by 16 points (52. 6% versus 36. 4%).

Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88. 6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0. 5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2. 9\%$.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shimon Malnick, Matan Rusanovsky, Ohad Fried, Shai Avidan

3h ago

Original

Optimal Transport Flow Matching by Design

AI Summary

The study presents a novel approach to optimal transport (OT) flow matching, reformulating the problem by treating the prior as a design choice. This method achieves over 2x reduction in trajectory curvature compared to existing methods, improving generation quality in few-step regimes without altering the flow model. The approach integrates seamlessly with latent-space models and classifier-free guidance.

#AI Coding #Inference #Open Source