Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

arXiv cs.CL·Ritvik Rastogi, Vishal Singh, Tejas Chaudhari, Sandeep Varma

5/29/2026

·~1 min·5/29/2026·en·2

Quick Answer

This paper shows that Aryabhata 2 is a reinforcement-learning-enhanced language model designed for competitive STEM exams like JEE and NEET, outperforming GPT-OSS-20B by achieving better reasoning with 64% fewer output tokens.

Quick Take

Aryabhata 2 is a reinforcement-learning-enhanced language model designed for competitive STEM exams like JEE and NEET, outperforming GPT-OSS-20B by achieving better reasoning with 64% fewer output tokens. Trained on PhysicsWallah's question banks, it excels in multi-step reasoning and numerical computation across various benchmarks.

Key Points

Aryabhata 2 uses reinforcement learning for post-training on GPT-OSS-20B.
It was trained on PhysicsWallah's internal question banks for high-quality curriculum.
The model shows superior performance on JEE Main, JEE Advanced, and NEET benchmarks.
Outperforms its base model while requiring significantly fewer output tokens.
Evaluated on various reasoning datasets, including AIME and .

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 28829v1 Announce Type: new Abstract: Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptual understanding across physics, chemistry, and mathematics. Recent large language models perform strongly on common reasoning benchmarks, yet they remain difficult to deploy at scale, where millions of student doubts demand domain-specific, consistently structured problem solving.

We introduce Aryabhata 2, a reasoning-focused language model for competitive STEM examinations, trained via reinforcement-learning post-training. Using PhysicsWallah's internal question banks, we construct a high-quality training curriculum and post-train GPT-OSS-20B through reinforcement learning with verifiable rewards. Training combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes.

We evaluate Aryabhata 2 on competitive examination benchmarks, including JEE Main, JEE Advanced, and NEET, as well as out-of-distribution reasoning datasets such as AIME, HMMT, , MMLU-Redux 2. 0, and . Results show that Aryabhata 2 outperforms its base model GPT-OSS-20B on competitive STEM reasoning while requiring substantially fewer output tokens (up to 64\% fewer).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

1d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems