EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

arXiv cs.CL·Yuxin Liu, Zihan Chen, Haoyu Wang, Mingxuan Zhang, Ruijie Lin, Siyuan Zhao

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

This paper shows that The Electro-Visual-Language Assistant (EVLA) enhances driving decision-making by integrating real-time vehicle state awareness with multimodal scene understanding, outperforming existing models by +0.0871 in scores and +5.6% in accuracy.

Quick Take

The Electro-Visual-Language Assistant (EVLA) enhances driving decision-making by integrating real-time vehicle state awareness with multimodal scene understanding, outperforming existing models by +0.0871 in scores and +5.6% in accuracy. Its innovative Unified Co-State Encoder and Electro-aware Structured Reasoning Chain lead to 36% faster inference, crucial for next-gen driving assistants.

Key Points

EVLA combines visual, textual, and vehicle state inputs for enhanced decision-making.
Achieves +0.0871 score improvement and +5.6% accuracy over fine-tuned baselines.
Utilizes a Unified Co-State Encoder for shared latent representation.
Features an Electro-aware Structured Reasoning Chain for deterministic reasoning.
Delivers 36% faster inference compared to traditional multi-stage pipelines.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 27 Jun 2026]

View PDF HTML (experimental)

Abstract:Modern vision-language models (VLMs) for driving assistants typically treat vehicle dynamics as a black box, resulting in decisions that lack awareness of the vehicle's real-time electro-mechanical state. To bridge this gap, we introduce the Electro-Visual-Language Assistant (EVLA) -- a novel framework that combines multi-modal scene understanding with real-time perception of the electrified powertrain state (e.g., motor torque, battery SOC). Our approach features two key innovations: first, a Unified Co-State Encoder (UCSE) that fuses visual, textual, and vehicle-state inputs into a shared latent representation, augmented with an Energy-Efficiency Field to model spatial energy costs; and second, an Electro-aware Structured Reasoning Chain (ESRC), which replaces external chain-of-thought prompting with an internal, deterministic reasoning process grounded in physical constraints and optimization objectives. Trained end-to-end with a physics-guided joint loss, EVLA learns to generate context-aware and energy-optimal driving decisions. Extensive evaluations on a driving QA benchmark demonstrate that EVLA substantially outperforms strong fine-tuned VLM baselines, improving the final score by +0.0871 and accuracy by +5.6\%. Ablation studies validate the necessity of each component, and efficiency analyses show that EVLA achieves 36\% faster inference than multi-stage pipelines. This work underscores that integrating vehicle-state awareness and structured physical reasoning is crucial for developing next-generation, physically-grounded driving assistants.

Comments:	17 pages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2606.28938 [cs.CL]
	(or arXiv:2606.28938v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2606.28938 arXiv-issued DOI via DataCite

Submission history

From: Yuxin Liu [view email]
[v1] Sat, 27 Jun 2026 14:20:26 UTC (8,663 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems