Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes

arXiv cs.CV·Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor

5d ago

·~1 min·5/25/2026·en·3

Quick Take

The study introduces a dual pose-image representation to enhance multi-person interaction scene generation in text-to-image models. By integrating structural priors into pretrained diffusion transformers, the model improves prompt alignment and scene diversity, addressing issues of repetitive layouts and stereotypical poses.

Key Points

Introduces a dual pose-image representation for improved scene generation.
Enhances prompt alignment and diversity in multi-person interactions.
Employs a cross-modal alignment scheme for consistent grounding.
Iterative scene construction progressively generates complex interactions.
Extensive experiments validate significant improvements over existing methods.

Article Excerpt

From source RSS / original summary

arXiv:2605. 23178v1 Announce Type: new Abstract: Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose-image representation that brings person-centric structural priors into pretrained diffusion transformers.

Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity.

Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source