GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation
Quick Take
GAP3D introduces a diffusion-based method for aligning vision-language model latents to patch-level embeddings, enabling efficient 3D asset generation without extensive 3D datasets. This approach leverages general-domain image-text pairs and shows zero-shot capabilities for multimodal prompts, marking a significant step towards integrating foundation models in generative tasks.
Key Points
- GAP3D aligns VLM latents to patch-level features for improved spatial conditioning.
- The method avoids large-scale 3D training by using general-domain image-text pairs.
- Demonstrates zero-shot capabilities for multimodal prompts despite text-only training.
- Focuses on high-level semantics, bridging gaps between VLM and image encoder features.
- Represents a modular approach to integrating foundation models in generative tasks.
Article Content
From source RSS / original summaryarXiv:2605. 28995v1 Announce Type: new Abstract: Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation.
To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs.
It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.