EO-Agents: A Three-Agent LLM Pipeline for Earth Observation Hypothesis Generation
Quick Answer
This paper shows that The EO-Agents pipeline utilizes a three-agent LLM system to generate scientifically grounded hypotheses from NASA's Earth Observation Knowledge Graph, producing 160 hypotheses across various Earth science domains.
Quick Take
The EO-Agents pipeline utilizes a three-agent LLM system to generate scientifically grounded hypotheses from NASA's Earth Observation Knowledge Graph, producing 160 hypotheses across various Earth science domains. A factorial experiment reveals stable hypothesis rankings across models GPT-5.2 and Claude Sonnet 4.6, while highlighting the variability in absolute scores based on judge identity.
Key Points
- The pipeline ranks dataset pairings using a heterogeneous graph neural network.
- 160 hypotheses generated span ecohydrology, glaciology, and more.
- Model-predicted dataset pairings are nearly as plausible as real co-usages.
- Hypothesis rankings remain stable across different LLMs.
- Single-judge evaluations reveal limitations in assessing hypothesis quality.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01584v1 Announce Type: new Abstract: Large language models have recently been explored for scientific hypothesis generation, but most prior work relies on unstructured literature and free-form textual claims. We present a pipeline for Earth observation that grounds hypothesis generation directly in the NASA Earth Observation Knowledge Graph.
A heterogeneous graph neural network trained on historical co-usage relations ranks candidate dataset pairings, and a three-agent LLM pipeline filters, generates, and evaluates structured research hypotheses. Applied to 1,475 NASA datasets, the system produces 160 hypotheses spanning multiple Earth-science domains, including ecohydrology, glaciology, aerosol--cloud interactions, vegetation phenology, and stratospheric chemistry.
Model-predicted novel dataset pairings are rated nearly as plausible as held-out real co-usages from the literature, indicating that the pipeline surfaces scientifically coherent yet unexplored combinations. A 2*2*2 factorial experiment across GPT-5. 2 and Claude Sonnet 4. 6 shows that hypothesis rankings remain stable, while absolute scores depend strongly on judge identity, highlighting limitations of single-judge LLM evaluation.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.