RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation
Quick Answer
RusFinChain is the first Russian-language benchmark for verifiable Chain-of-Thought reasoning in finance, featuring 5,280 examples across 17 domains.
Quick Take
RusFinChain is the first Russian-language benchmark for verifiable Chain-of-Thought reasoning in finance, featuring 5,280 examples across 17 domains. Evaluation of 8 open-weight LLMs shows a Hard F1 score of ~0.65 for step alignment, but only ~29% of final answers are correct, highlighting a significant reasoning gap.
Key Points
- RusFinChain includes 5,280 parameterized examples from executable Python templates.
- Enhanced metrics like Fuzzy Numeric Alignment show better correlation with answer correctness.
- Models achieved ~0.65 Hard F1 for step alignment but only ~29% final answer accuracy.
- Dataset and evaluation framework released to support Russian-speaking financial AI development.
- Evaluation involved 8 open-weight LLMs generating 8,100 responses.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01388v1 Announce Type: new Abstract: Multi-step symbolic reasoning is essential for robust financial analysis, yet most benchmarks neglect intermediate reasoning steps. FINCHAIN introduced verifiable Chain-of-Thought (CoT) evaluation but is limited to English. FINESSE-Bench includes a Russian block but relies on multiple-choice questions without step-level supervision. We present RusFinChain, the first Russian-language symbolic benchmark for verifiable CoT reasoning in finance.
It spans 17 domains, 172 topics, and comprises 5,280 parameterized examples from executable Python templates, ensuring contamination-free evaluation. Each example includes a gold-standard reasoning chain with intermediate numeric values for automatic verification. We also introduce enhanced metrics: Fuzzy Numeric Alignment and Soft-Attention Alignment. We evaluate 8 open-weight LLMs on a stratified sample, generating 8,100 responses. Results reveal a substantial reasoning gap: models achieve Hard F1 of ~0.
65 for step alignment, but only ~29% of final answers are correct. Our fuzzy and soft metrics show stronger correlation with final-answer correctness (Spearman rho approx 0. 48) than the original ChainEval (rho approx 0. 38-0. 46), demonstrating superior diagnostic power. We release dataset, code, and evaluation framework to foster verifiable financial AI for the Russian-speaking community.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.