EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control
Quick Answer
This paper shows that The Electro-Visual-Language Assistant (EVLA) enhances driving decision-making by integrating real-time vehicle state awareness with multimodal scene understanding, outperforming existing models by +0.0871 in scores and +5.6% in accuracy.
Quick Take
The Electro-Visual-Language Assistant (EVLA) enhances driving decision-making by integrating real-time vehicle state awareness with multimodal scene understanding, outperforming existing models by +0.0871 in scores and +5.6% in accuracy. Its innovative Unified Co-State Encoder and Electro-aware Structured Reasoning Chain lead to 36% faster inference, crucial for next-gen driving assistants.
Key Points
- EVLA combines visual, textual, and vehicle state inputs for enhanced decision-making.
- Achieves +0.0871 score improvement and +5.6% accuracy over fine-tuned baselines.
- Utilizes a Unified Co-State Encoder for shared latent representation.
- Features an Electro-aware Structured Reasoning Chain for deterministic reasoning.
- Delivers 36% faster inference compared to traditional multi-stage pipelines.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle's real-time electro-mechanical state. To bridge this gap, we introduce the Electro-Visual-Language Assistant (EVLA) -- a novel framework that combines multi-modal scene understanding with real-time perception of the electrified powertrain state (e.g., motor torque, battery SOC). Our approach features two key innovations: first, a Unified Co-State Encoder (UCSE) that fuses visual, textual, and vehicle-state inputs into a shared latent representation, augmented with an Energy-Efficiency Field to model spatial energy costs; and second, an Electro-aware Structured Reasoning Chain (ESRC), which replaces external chain-of-thought prompting with an internal, deterministic reasoning process grounded in physical constraints and optimization objectives. Trained end-to-end with a physics-guided joint loss, EVLA learns to generate context-aware and energy-optimal driving decisions. Extensive evaluations on a driving QA benchmark demonstrate that EVLA substantially outperforms strong fine-tuned VLM baselines, improving the final score by +0.0871 and accuracy by +5.6\%. Ablation studies validate the necessity of each component, and efficiency analyses show that EVLA achieves 36\% faster inference than multi-stage pipelines. This work underscores that integrating vehicle-state awareness and structured physical reasoning is crucial for developing next-generation, physically-grounded driving assistants.
| Comments: | 17 pages |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.28938 [cs.CL] |
| (or arXiv:2606.28938v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28938 arXiv-issued DOI via DataCite |
Submission history
From: Yuxin Liu [view email]
[v1]
Sat, 27 Jun 2026 14:20:26 UTC (8,663 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.