Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention
Quick Answer
FocusDiff introduces a tuning-free diffusion model for precise indoor panorama editing, overcoming challenges like prompt brittleness and spillover edits.
Quick Take
FocusDiff introduces a tuning-free diffusion model for precise indoor panorama editing, overcoming challenges like prompt brittleness and spillover edits. It outperforms existing zero-shot editors on the LIMB benchmark, achieving superior precision and photorealism in 360-degree environments.
Key Points
- FocusDiff utilizes refocusing cross-attention for targeted image manipulation.
- Achieves superior text-image alignment and background preservation in edits.
- Demonstrated effectiveness in virtual reality environments.
- Extensive experiments on 30 multi-object images validate its performance.
- Outperforms existing models in precision and usability.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 14035v1 Announce Type: new Abstract: Zero-shot text-guided diffusion has significantly advanced image editing; however, its practical usability remains constrained by three persistent challenges: prompt brittleness that requires meticulous prompt engineering, spillover edits that unintentionally affect non-target regions, and failures on small or cluttered objects caused by limited fine-grained supervision in training data.
We propose FocusDiff (Target-Aware Refocusing for Tuning-Free Diffusion Editing), a tuning-free framework for precise and region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while accurately transferring the object's identity, structure, and appearance to the edited output.
Integrated context-preserving modules further ensure background fidelity and global coherence, enabling accurate edits from simple text prompts in a single pass. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments.
Extensive experiments on our localized editing benchmark LIMB, comprising 30 multi-object images and 100 annotated examples including challenging small-object cases, show that FocusDiff outperforms existing zero-shot editors in text-image alignment and background preservation, achieving superior precision, photorealism, and usability. The project page is available at https://vdkhoi20. github. io/FocusDiff.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.