GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation

arXiv cs.CV·Polytimi Anna Gkotsi, Andrii Zadaianchuk, Mohammad Mahdi Derakhshani

1d ago

·~1 min·5/29/2026·en·0

Quick Take

GAP3D introduces a diffusion-based method for aligning vision-language model latents to patch-level embeddings, enabling efficient 3D asset generation without extensive 3D datasets. This approach leverages general-domain image-text pairs and shows zero-shot capabilities for multimodal prompts, marking a significant step towards integrating foundation models in generative tasks.

Key Points

GAP3D aligns VLM latents to patch-level features for improved spatial conditioning.
The method avoids large-scale 3D training by using general-domain image-text pairs.
Demonstrates zero-shot capabilities for multimodal prompts despite text-only training.
Focuses on high-level semantics, bridging gaps between VLM and image encoder features.
Represents a modular approach to integrating foundation models in generative tasks.

Article Content

From source RSS / original summary

arXiv:2605. 28995v1 Announce Type: new Abstract: Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation.

To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs.

It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source