Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
Quick Take
Aryabhata 2 is a reinforcement-learning-enhanced language model designed for competitive STEM exams like JEE and NEET, outperforming GPT-OSS-20B by achieving better reasoning with 64% fewer output tokens. Trained on PhysicsWallah's question banks, it excels in multi-step reasoning and numerical computation across various benchmarks.
Key Points
- Aryabhata 2 uses reinforcement learning for post-training on GPT-OSS-20B.
- It was trained on PhysicsWallah's internal question banks for high-quality curriculum.
- The model shows superior performance on JEE Main, JEE Advanced, and NEET benchmarks.
- Outperforms its base model while requiring significantly fewer output tokens.
- Evaluated on various reasoning datasets, including AIME and MMLU-Pro.
Article Content
From source RSS / original summaryarXiv:2605. 28829v1 Announce Type: new Abstract: Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving.
We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes.
We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, MMLU-Pro, MMLU-Redux 2. 0, and GPQA. Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.