LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment
Quick Answer
LAST introduces a Gromov-Wasserstein approach to align Vision-Language and Action manifolds, overcoming their mathematical heterogeneity.
Quick Take
LAST introduces a Gromov-Wasserstein approach to align Vision-Language and Action manifolds, overcoming their mathematical heterogeneity. By employing a two-stage transformation—Global Topological Linearization and Local Metric Discretization—LAST enhances the compatibility and performance of VLA models, leading to improved convergence and generalizability.
Key Points
- LAST uses Lie-algebraic mapping for global topological linearization of action manifolds.
- The model converts action trajectories into fixed-length, physically additive representations.
- Local metric discretization yields approximately isotropic local charts aligned with semantic metrics.
- LAST resolves structural mismatches at both global and local levels.
- The approach significantly enhances convergence and generalizability of VLA models.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11221v1 Announce Type: new Abstract: We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this alignment is non-trivial due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, whereas the physical manifold of robotic action is non-Euclidean and anisotropic.
Their disjoint metric structures render direct regression ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer), which reconstructs the action space to establish local metric compatibility with the VL modality via a two-stage transformation: (1) Global Topological Linearization: linearizing the action manifold via Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation.
(2) Local Metric Discretization: hierarchically discretizing the representation into schemas and whitened residuals, yielding approximately isotropic local charts that are statistically aligned with the semantic metric. By resolving the structural mismatch at both global and local levels, LAST enables VLA models with superior convergence and generalizability.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.