This paper shows that Nemotron 3 Ultra is a 550 billion parameter Mixture-of-Experts model that achieves ~6x higher inference throughput than leading LLMs while maintaining state-of-the-art accuracy.
Nemotron 3 Ultra is a 550 billion parameter Mixture-of-Experts model that achieves ~6x higher inference throughput than leading LLMs while maintaining state-of-the-art accuracy. It supports a context length of 1 million tokens, making it suitable for complex autonomous tasks. The model is open-sourced with training data on HuggingFace.
arXiv:2606. 15007v1 Announce Type: new Abstract: We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD).
Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks.
We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.
Reader Mode unavailable (could not extract clean content).
Daily brief at your local 8am — bilingual EN/中文, free.
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.
This study evaluates LLM-based urban simulators like AgentSociety and CitySim, revealing a significant gap between narrative plausibility and real-world mobility realism. Using datasets from Greater Paris and Shanghai, the analysis shows these models struggle with core spatial and temporal constraints, necessitating rigorous empirical validation and improved initialization methods for realistic urban simulations.
The QIAS 2026 shared task evaluates large language models' reasoning in Islamic inheritance, utilizing the MAWARITH dataset of 12,500 annotated cases. Sixteen teams participated, revealing significant challenges in legal interpretation and numerical reasoning, with results indicating current models struggle with complex inheritance calculations.