Making Failure Safe: A Constrained, Verifiable Agent Framework for Open-Web Data Collection

2h ago

·~1 min·7/2/2026·en·0

Quick Answer

The proposed constrained, verifiable agent framework enhances web data collection by transforming LLM-generated code into typed JSON configurations, achieving zero LLM tokens during execution and the lowest average wall-clock time across 80 tasks, making it a reliable and reusable solution for open-web data scraping.

Quick Take

Key Points

Framework uses a six-type collector taxonomy for structured web scraping.
Achieved zero execution-stage LLM tokens on 80 verified tasks.
Lowest average wall-clock time recorded for data collection tasks.
Combines static Airflow DAG execution with rule-based quality checks.
Supports description-based requirement typing for better task handling.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2607. 00035v1 Announce Type: new Abstract: LLMs and agents can generate web scrapers from natural-language requirements, but direct generation remains unreliable because of dependency errors, broken selectors, schema mismatches, and heterogeneous page structures.

We propose a constrained, verifiable agent framework that shifts LLM output from free-form code to typed JSON collector configurations, combining a six-type collector taxonomy, template and utility-function constraints, static Airflow DAG execution, rule-based quality checking, and structured feedback correction.

Experiments on 138 tasks show that the taxonomy supports description-based requirement typing, while confirming that stable instantiation requires completing source, field, and execution constraints beyond the initial description. On 80 independently source-verified tasks, the framework runs with zero execution-stage LLM tokens and the lowest average wall-clock time, trading moderate one-shot quality for a reusable, deterministic, and verifiable execution path suited to repeated scheduled collection.

These results position the framework as a reusable, low-cost, and verifiable execution path for repeated open-web data collection.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen, Rongyao Fang, Siyuan Zhang, Xuwu Wang, Yuheng Jing, Zeyao Ma, Zeyu Cui

6d ago

FeaturedOriginal

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

AI Summary

As coding agents evolve, verifying solutions becomes more challenging than generating them, necessitating a focus on scalable, faithful, and robust verification methods. The study reveals that no fixed reward function can sustain effectiveness as model capabilities advance, emphasizing the need for verification to evolve alongside solution generation.

#Agent #AI Coding #Inference #Policy