CoGeoAD: Hierarchical Color-Geometric Fusion with Multi-View Attention for Zero-Shot 3D Anomaly Detection
Quick Answer
CoGeoAD introduces a unified CLIP-based framework for zero-shot 3D anomaly detection, effectively fusing 2D color images and 3D geometric structures.
Quick Take
CoGeoAD introduces a unified CLIP-based framework for zero-shot 3D anomaly detection, effectively fusing 2D color images and 3D geometric structures. Its innovative Data-Driven Multi-View Attention mechanism and Multi-Stage Color-Geometric Fusion module achieve state-of-the-art performance on MVTec3D-AD and Eyecandies benchmarks, addressing critical industrial quality inspection challenges.
Key Points
- CoGeoAD fuses 2D color and 3D geometric features for anomaly detection.
- Utilizes Data-Driven Multi-View Attention for adaptive 3D feature aggregation.
- Achieves state-of-the-art results on MVTec3D-AD and Eyecandies benchmarks.
- Addresses the scarcity of labeled anomaly samples in industrial settings.
- Source code available at https://github.com/kingdomShu/CoGeoAD.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Zero-shot 3D anomaly detection is essential for industrial quality inspection, where labeled anomaly samples are scarce. Meanwhile, existing methods lack an effective mechanism to fuse complementary 2D color images with 3D geometric structures, limiting their ability to detect both surface and structural defects in a unified framework. To address these issues, we propose CoGeoAD, a unified CLIP-based framework that fuses color and geometric features by constructing pixel-aligned paired multi-view images. The framework introduces a Data-Driven Multi-View Attention (MVA) mechanism to adaptively aggregate 3D features and a Multi-Stage Color-Geometric Fusion (MS-CGF) module to hierarchically integrate multi-level features from both modalities. Extensive experiments on the MVTec3D-AD and Eyecandies benchmarks demonstrate that CoGeoAD achieves state-of-the-art performance, effectively capturing both structural and textural anomalies in complex industrial scenarios. our source code is available at this https URL.
| Comments: | ICML 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2606.25273 [cs.CV] |
| (or arXiv:2606.25273v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.25273 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Ke Xu [view email]
[v1]
Wed, 24 Jun 2026 01:12:22 UTC (4,521 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.