NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation
Quick Answer
This paper shows that NAVI-Orbital, deployed on a Low Earth Orbit spacecraft, achieved the first in-orbit demonstration of the Gemma 3 vision-language model for autonomous Earth observation, achieving 88.16% accuracy on the AID benchmark.
Quick Take
NAVI-Orbital, deployed on a Low Earth Orbit spacecraft, achieved the first in-orbit demonstration of the Gemma 3 vision-language model for autonomous Earth observation, achieving 88.16% accuracy on the AID benchmark. This system processes imagery onboard, enabling semantic compression and reducing the need for extensive downlink bandwidth.
Key Points
- NAVI-Orbital uses Gemma 3 for autonomous multi-modal inference onboard.
- Achieved 88.16% accuracy on the AID benchmark with 7,960 curated images.
- Utilizes natural-language dialogue for re-tasking via plain-English prompts.
- Demonstrated feasibility of running foundation models on satellite-class edge computers.
- Enables semantic compression of Earth observations to optimize bandwidth.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18271v1 Announce Type: new Abstract: As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft.
On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue.
The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.
16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.