Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning
Quick Answer
The IRTS-ToolBench introduces a benchmark of 1,700 questions across 10 task types to evaluate LLMs in irregular time series question answering (TSQA), addressing gaps in current benchmarks that assume regular sampling.
Quick Take
The IRTS-ToolBench introduces a benchmark of 1,700 questions across 10 task types to evaluate LLMs in irregular time series question answering (TSQA), addressing gaps in current benchmarks that assume regular sampling. This tool aims to enhance the understanding of AI agents' performance under real-world conditions with asynchronous observations and informative missing values.
Key Points
- IRTS-ToolBench features 1,700 questions across 13 domains for evaluating irregular TSQA.
- Existing benchmarks fail to address irregular sampling in time series data.
- The benchmark supports independent use by researchers focusing on LLM-based analysis.
- Standardized inputs and reproducible evaluation protocols are provided for consistency.
- Code for IRTS-ToolBench is available on GitHub.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 15107v1 Announce Type: new Abstract: Time series data in real-world deployments is overwhelmingly irregular. Observations are asynchronous, missing values are informative rather than random, and sampling frequencies vary across sensors and operational windows. However, existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, leaving a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions.
To bridge this gap, we introduce IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains. IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol. Code can be found in https://github. com/SanhornC/IRTS-ToolBench.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.