Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline
Quick Answer
This paper introduces the Multi-Modal Industrial Open Dataset (MMIO) with over 80K samples for zero-shot industrial defect detection, achieving state-of-the-art results with 42.2% and 24.7% AP in zero-shot and closed scenes, respectively.
Quick Take
This paper introduces the Multi-Modal Industrial Open Dataset (MMIO) with over 80K samples for zero-shot industrial defect detection, achieving state-of-the-art results with 42.2% and 24.7% AP in zero-shot and closed scenes, respectively. It also presents a Refined Text-Visual Prompt (RTVP) that enhances large model adaptation and improves visual-textual understanding in industrial applications.
Key Points
- MMIO is the first large-scale dataset for industrial zero-shot learning with diverse categories.
- RTVP improves large model generalization through expert-guided domain adaptation.
- The dataset contains 6 super categories and 18 subcategories for comprehensive training.
- RTVP generates visual prompts directly from images, enhancing understanding.
- Achieved state-of-the-art performance with 42.2% AP in zero-shot scenarios.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 07965v1 Announce Type: new Abstract: Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored.
To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories.
MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks.
RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42. 2% and 24.
7% AP in zero-shot and closed scenes of MMIO.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.