CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
Quick Take
CrossVLA optimizes vision-language-action models post-training across paradigms, enhancing performance and efficiency.
Key Points
- Introduces a log-probability estimator for continuous-action DPO.
- DoRA outperforms OpenVLA SFT by +10.4 pp on average.
- Inference analysis reveals latency bottlenecks in denoise loops.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning
GeoSym127K introduces a scalable neuro-symbolic framework for enhanced geometric reasoning in multimodal models.