ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

arXiv cs.CL·Yanjun Lin, Zimo Xiao, Kartik Natarajan, Mahesh Sankaranarayanan, Niraj Nawanit, Rakshit Parashar, Austin Zhang, Karthik Konaraddi, Rishita Mote, Wei Niu

17h ago

·~2 min·5/20/2026·en·1

Quick Take

ReacTOD introduces a neuro-symbolic NLU architecture for improved zero-shot dialogue state tracking.

Key Points

Achieves 52.71% joint goal accuracy on MultiWOZ 2.1.
Implements a self-correcting ReAct loop for enhanced accuracy.
Cross-benchmark generalization without task-specific training data.

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:Task-oriented dialogue systems -- handling transactions, reservations, and service requests -- require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% -- demonstrating cross-benchmark generalization without task-specific training data.

Comments:	Accepted at TrustNLP Workshop at ACL 2026
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.19077 [cs.CL]
	(or arXiv:2605.19077v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19077 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yanjun Lin [view email]
[v1] Mon, 18 May 2026 20:06:04 UTC (41 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets