Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
Quick Answer
The Visual-SDPO framework enhances code-generated visual artifacts by utilizing visual feedback for self-distillation, improving performance by over 10 points on benchmarks like ChartMimic and Design2Code, with fewer training steps and no added inference costs.
Quick Take
The Visual-SDPO framework enhances code-generated visual artifacts by utilizing visual feedback for self-distillation, improving performance by over 10 points on benchmarks like ChartMimic and Design2Code, with fewer training steps and no added inference costs.
Key Points
- Visual-SDPO improves visual artifact generation by treating rendered feedback as privileged context.
- Introduces Visual-Grounded Code Credit Weighting to target specific code statements for defect correction.
- Achieves over 10 absolute points improvement on benchmarks with fewer training steps.
- Utilizes a unified Qwen3-VL-8B-Instruct backbone for various visual generation tasks.
- Maintains learnability of failed executions through self-distillation paths.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 10334v1 Announce Type: new Abstract: Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow.
We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student.
To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements.
A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone.
Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2. 4 points, with fewer training steps and no added inference-time cost.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor introduces a multi-agent framework utilizing structured tree search for optimizing LLM inference, achieving up to 193% throughput-latency improvement compared to vendor-optimized systems. It employs an Orchestrator and Critic agent for stability and coordination, demonstrating hardware-agnostic performance with minimal variance.