Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
Quick Take
The proposed STHTD-MP method enhances off-policy prediction by using a behavior-induced metric, improving convergence over traditional methods like GTD2-MP. The analysis indicates a smaller mean contraction factor under favorable conditions, validated by benchmarks including Random Walk and Boyan Chain.
Key Points
- STHTD-MP replaces covariance metrics with behavior-policy Bellman matrix for better geometry.
- Formal convergence analysis shows positive definite behavior-induced metrics enhance stability.
- Numerical tests on two-state, Random Walk, and Boyan Chain benchmarks support performance claims.
- GTD2-MP comparison reveals STHTD-MP's potential for reduced mean contraction factors.
- Baird's counterexample highlights limitations of strict assumptions in certain scenarios.
Article Content
From source RSS / original summaryarXiv:2605. 28849v1 Announce Type: new Abstract: Gradient temporal-difference methods provide stable off-policy prediction with linear function approximation, but their practical performance is strongly affected by the geometry induced by the auxiliary-variable metric. Existing Mirror-Prox TD methods typically use the feature covariance metric, whereas hybrid TD methods suggest that behavior-policy transition information can provide a more informative update geometry.
This paper proposes a behavior-induced Mirror-Prox temporal-difference method, called STHTD-MP, which replaces the covariance metric in the primal-dual saddle-point formulation with the symmetric part of the behavior-policy Bellman matrix. The method keeps a single learning rate for the primal and auxiliary variables and applies a Mirror-Prox prediction-correction step to the resulting hybrid saddle-point operator.
We provide a formal convergence analysis for fixed-policy linear prediction under standard stochastic approximation assumptions: the behavior-induced metric is positive definite, the joint mean system is Hurwitz, boundedness follows from a Lyapunov argument, and the stochastic recursion converges by the ODE method. We further derive projected-oracle ergodic gap bounds and an exact mean-operator comparison with GTD2-MP based on the spectral radius of the deterministic Mirror-Prox error matrix.
The analysis shows that STHTD-MP can have a smaller mean contraction factor than GTD2-MP when the behavior-induced metric improves the saddle-point geometry. Exact numerical mean-operator analysis on two-state, Random Walk, and Boyan Chain benchmarks supports this condition, while Baird's counterexample is identified as a singular boundary case where the strict assumptions fail.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
The Redpanda Agentic Data Plane (ADP) introduces out-of-band metadata channels to enhance the safety of autonomous AI agents, ensuring secure data access and tamper-proof audit trails. This architecture mitigates risks associated with unpredictable AI behavior by enforcing governance throughout the agent lifecycle, demonstrated in a multi-agent trading system with strict data scoping and approval thresholds.