Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval
Quick Answer
The proposed system automatically generates executable schemas from diverse data sources, enhancing knowledge graph construction and retrieval.
Quick Take
The proposed system automatically generates executable schemas from diverse data sources, enhancing knowledge graph construction and retrieval. It outperforms retrieval-only and decomposition methods across four QA benchmarks, showcasing improved performance through schema-conditioned routing and structural intelligence.
Key Points
- Automatically discovers executable schemas from raw multi-source data.
- Improves performance over retrieval-only and decomposition baselines in QA benchmarks.
- Utilizes schema-conditioned routing for enhanced query-time retrieval.
- Incorporates structural analysis to infer identity and foreign keys.
- Supports multi-tool agent routing for diverse retrieval methods.
Article Content
From source RSS / original summaryarXiv:2606. 05415v1 Announce Type: new Abstract: Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval.
A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph.
At query time the schema -- optionally extended via a monotonic protocol -- conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations.
In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.