EO-Agents: A Three-Agent LLM Pipeline for Earth Observation Hypothesis Generation

arXiv cs.AI·Mahyar Ghazanfari, Amin Tabrizian, Armin Mehrabian, Peng Wei

3h ago

·~1 min·7/3/2026·en·0

Quick Answer

This paper shows that The EO-Agents pipeline utilizes a three-agent LLM system to generate scientifically grounded hypotheses from NASA's Earth Observation Knowledge Graph, producing 160 hypotheses across various Earth science domains.

Quick Take

The EO-Agents pipeline utilizes a three-agent LLM system to generate scientifically grounded hypotheses from NASA's Earth Observation Knowledge Graph, producing 160 hypotheses across various Earth science domains. A factorial experiment reveals stable hypothesis rankings across models GPT-5.2 and Claude Sonnet 4.6, while highlighting the variability in absolute scores based on judge identity.

Key Points

The pipeline ranks dataset pairings using a heterogeneous graph neural network.
160 hypotheses generated span ecohydrology, glaciology, and more.
Model-predicted dataset pairings are nearly as plausible as real co-usages.
Hypothesis rankings remain stable across different LLMs.
Single-judge evaluations reveal limitations in assessing hypothesis quality.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2607. 01584v1 Announce Type: new Abstract: Large language models have recently been explored for scientific hypothesis generation, but most prior work relies on unstructured literature and free-form textual claims. We present a pipeline for Earth observation that grounds hypothesis generation directly in the NASA Earth Observation Knowledge Graph.

A heterogeneous graph neural network trained on historical co-usage relations ranks candidate dataset pairings, and a three-agent LLM pipeline filters, generates, and evaluates structured research hypotheses. Applied to 1,475 NASA datasets, the system produces 160 hypotheses spanning multiple Earth-science domains, including ecohydrology, glaciology, aerosol--cloud interactions, vegetation phenology, and stratospheric chemistry.

Model-predicted novel dataset pairings are rated nearly as plausible as held-out real co-usages from the literature, indicating that the pipeline surfaces scientifically coherent yet unexplored combinations. A 2*2*2 factorial experiment across GPT-5. 2 and Claude Sonnet 4. 6 shows that hypothesis rankings remain stable, while absolute scores depend strongly on judge identity, highlighting limitations of single-judge LLM evaluation.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3h ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy