DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

arXiv cs.CV·Qianxin Xia, Zhiyong Shu, Wenbo Jiang, Jiawei Du, Jielei Wang, Guoming Lu

3d ago

·~2 min·5/14/2026·en·1

Quick Take

DIVER introduces a dual-stage distillation framework enhancing semantic recovery for improved dataset distillation.

Key Points

Utilizes pre-trained diffusion models for deeper semantic analysis.
Improves cross-architecture generalization with efficient processing.
Code available on GitHub for further exploration.

📖 Reader Mode

~2 min read

[Submitted on 12 May 2026]

View PDF HTML (experimental)

Abstract:Dataset distillation aims to synthesize a compact proxy dataset that is unreadable or non-raw from the original dataset for privacy protection and highly efficient learning. However, previous approaches typically adopt a single-stage distillation paradigm, which suffers from learning specific patterns that overfit on a prior architecture, consequently suppressing the expression of semantics and leading to performance degradation across heterogeneous architectures. To address this issue, we propose a novel dual-stage distillation framework called ${\textbf{DIVER}}$, which leverages the pre-trained diffusion model to dive deeper into $\textbf{DI}$stilled data $\textbf{V}$ia $\textbf{E}$xpressive semantic $\textbf{R}$ecovery, an entire process of semantic inheritance, guidance, and fusion. Semantic inheritance distills high-level semantics of abstract distilled images into the latent space to filter out architecture-specific ``noise" and retain the intrinsic semantics. Furthermore, semantic guidance improves the preservation of the original semantics by directing the reverse procedure. Finally, semantic fusion is designed to provide semantic guidance only during the concrete phase of the reverse process, preventing semantic ambiguity and artifacts while maintaining the guidance information. Extensive experiments validate the effectiveness and efficiency of DIVER in improving classical distillation techniques and significantly improving cross-architecture generalization, requiring processing time comparable to raw DiT on ImageNet (256$\times$256) with only 4 GB of GPU memory usage. Code is available: this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2605.12649 [cs.CV]
	(or arXiv:2605.12649v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2605.12649 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Qianxin Xia [view email]
[v1] Tue, 12 May 2026 18:55:53 UTC (16,215 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

Related in this space

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Enhanced and Efficient Reasoning in Large Learning Models

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study