DeepSeek-V4 introduces two advanced MoE language models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, featuring up to 1.6T and 284B parameters respectively, both capable of processing one million tokens efficiently.
DeepSeek-V4 introduces two advanced MoE language models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, featuring up to 1.6T and 284B parameters respectively, both capable of processing one million tokens efficiently. With significant architectural upgrades and a new Muon optimizer, these models achieve state-of-the-art performance in long-context tasks while drastically reducing computational costs compared to their predecessor, DeepSeek-V3.2.
arXiv:2606. 19348v1 Announce Type: new Abstract: We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models -- DeepSeek-V4-Pro with 1. 6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) -- both supporting a context length of one million tokens.
DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability.
We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios.
In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3. 2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface. co/collections/deepseek-ai/deepseek-v4.
Reader Mode unavailable (could not extract clean content).
Daily brief at your local 8am — bilingual EN/中文, free.
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.
This study evaluates LLM-based urban simulators like AgentSociety and CitySim, revealing a significant gap between narrative plausibility and real-world mobility realism. Using datasets from Greater Paris and Shanghai, the analysis shows these models struggle with core spatial and temporal constraints, necessitating rigorous empirical validation and improved initialization methods for realistic urban simulations.
The self-generated T2T editing method enhances LLaDA2.1's performance by addressing training-inference mismatches, improving accuracy while reducing edit intensity. This approach involves a no-gradient draft pass and a recovery supervision pass, leading to fewer transcription errors and excessive self-corrections in generated outputs.