CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis
Quick Answer
CaricHarmony introduces a training-free method for caricature synthesis, resolving identity-shape conflicts through parallel diffusion paths, achieving a 0.8615 shape CLIP score and generating outputs in under 16 seconds.
Quick Take
CaricHarmony introduces a training-free method for caricature synthesis, resolving identity-shape conflicts through parallel diffusion paths, achieving a 0.8615 shape CLIP score and generating outputs in under 16 seconds. This approach significantly enhances creative control while maintaining recognition, outperforming existing methods like DemoCaricature and CaricatureBooth.
Key Points
- CaricHarmony uses three diffusion paths: identity, shape, and harmonized output.
- Novel energy functions guide the generation process for optimal balance.
- Achieves a shape CLIP score of 0.8615, surpassing previous benchmarks.
- Generates caricatures in under 16 seconds, significantly faster than competitors.
- Redefines ID-shape conflict as conditioning signal contamination in diffusion models.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 13964v1 Announce Type: new Abstract: Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as \emph{condition signal contamination} -- competing probability distributions in the denoising trajectory that make balanced generation impossible.
We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: $\mathcal{P}^{\mathrm{i}}$ (pure identity), $\mathcal{P}^{\mathrm{s}}$ (pure shape), and $\mathcal{P}^{\mathrm{i+s}}$ (harmonized output).
Novel energy functions operating on cross-attention features provide gradient guidance that steers $\mathcal{P}^{\mathrm{i+s}}$ toward optimal balance: $\mathcal{E}_{\mathrm{shape}}$ ensures sketch fidelity through layout and semantic alignment, while $\mathcal{E}_{\mathrm{id}}$ employs token-level correspondence matching robust to extreme distortions.
Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0. 8615 shape CLIP score (vs. 0. 8450) under comparable identity consistency score, with 7. 81 overall user preference score (vs. 6. 06).
Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.