Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

arXiv cs.AI·Soichiro Nishimori, Shinri Okano, Keigo Habara, Sotetsu Koyamada, Eason Yu, Masashi Sugiyama

5/22/2026

·~2 min·5/22/2026·en·4

Quick Answer

Mahjax is a GPU-accelerated Riichi Mahjong simulator developed in JAX, achieving up to 2 million steps per second on NVIDIA A100 GPUs.

Quick Take

Mahjax is a GPU-accelerated Riichi Mahjong simulator developed in JAX, achieving up to 2 million steps per second on NVIDIA A100 GPUs. It enables reinforcement learning from scratch, demonstrating effective agent training against baseline policies, thus enhancing decision-making research in complex environments.

Key Points

Achieves 2 million steps per second on eight NVIDIA A100 GPUs.
Fully vectorized environment allows large-scale rollout parallelization.
High-quality visualization tool aids in debugging and agent interaction.
Demonstrates effective training of agents to improve their ranks.
Facilitates research in reinforcement learning for complex decision-making.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 20 May 2026]

View PDF HTML (experimental)

Abstract:Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2605.20577 [cs.AI]
	(or arXiv:2605.20577v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.20577 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Soichiro Nishimori [view email]
[v1] Wed, 20 May 2026 00:33:28 UTC (217 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols

How Do Tool-Augmented LLM Agents Perform on Real-World Energy Analytics Tasks?

Related in this space

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw