LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment

arXiv cs.CV·Huaihai Lyu, Chaofan Chen, Yuheng Ji, Xiansheng Chen, Pengwei Wang, Shanghang Zhang, Changsheng Xu

2d ago

·~1 min·6/11/2026·en·0

Quick Answer

LAST introduces a Gromov-Wasserstein approach to align Vision-Language and Action manifolds, overcoming their mathematical heterogeneity.

Quick Take

LAST introduces a Gromov-Wasserstein approach to align Vision-Language and Action manifolds, overcoming their mathematical heterogeneity. By employing a two-stage transformation—Global Topological Linearization and Local Metric Discretization—LAST enhances the compatibility and performance of VLA models, leading to improved convergence and generalizability.

Key Points

LAST uses Lie-algebraic mapping for global topological linearization of action manifolds.
The model converts action trajectories into fixed-length, physically additive representations.
Local metric discretization yields approximately isotropic local charts aligned with semantic metrics.
LAST resolves structural mismatches at both global and local levels.
The approach significantly enhances convergence and generalizability of VLA models.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 11221v1 Announce Type: new Abstract: We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this alignment is non-trivial due to the mathematical heterogeneity between the domains: the semantic space of vision-language is topologically linear and isotropic, whereas the physical manifold of robotic action is non-Euclidean and anisotropic.

Their disjoint metric structures render direct regression ill-posed. To resolve this incompatibility, we introduce LAST (Lie-algebraic Action Space Tokenizer), which reconstructs the action space to establish local metric compatibility with the VL modality via a two-stage transformation: (1) Global Topological Linearization: linearizing the action manifold via Lie-algebraic mapping, converting trajectories into a fixed-length, physically additive representation.

(2) Local Metric Discretization: hierarchically discretizing the representation into schemas and whitened residuals, yielding approximately isotropic local charts that are statistically aligned with the semantic metric. By resolving the structural mismatch at both global and local levels, LAST enables VLA models with superior convergence and generalizability.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup