MIBE: Multi-subject Interaction Benchmark and Evaluator for Personalized Image Generation
Quick Answer
This paper shows that The Multi-subject Interaction Benchmark and Evaluator (MIBE) introduces a framework for personalized image generation, addressing limitations in existing models.
Quick Take
The Multi-subject Interaction Benchmark and Evaluator (MIBE) introduces a framework for personalized image generation, addressing limitations in existing models. MIBE includes a 60K-pair Silver Set and a 4K-pair Gold Set, achieving 95.1% cross- preference agreement. The Multi-subject Interaction Evaluator (MIE) demonstrates 0.922 pairwise accuracy against human preferences, outperforming traditional metrics like CLIP and DINO.
Key Points
- MIBE features a 60K-pair Silver Set for scalable metric training.
- The Gold Set includes 4K pairs for double-blind human evaluation.
- MIE achieves 0.922 pairwise accuracy against human preferences.
- MIE outperforms baseline metrics like CLIP and DINO.
- The framework enhances multi-subject interaction fidelity in image generation.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01383v1 Announce Type: new Abstract: Multi-subject personalized image generation requires the precise rendering of all requested reference identities and their specified interactions based on a guiding prompt. However, state-of-the-art models still struggle with this process, frequently omitting subjects, failing to preserve reference appearances, or misattributing interactions.
Furthermore, existing metrics designed primarily for single-subject fidelity cannot reliably capture these errors, suffering severe degradation in ranking separability and failing to align with human preference as the subject count increases. To address this gap, we introduce Multi-subject Interaction Benchmark and Evaluator (MIBE), a unified framework comprising a Multi-subject Interaction Benchmark (MIB) and a Multi-subject Interaction Evaluator (MIE).
MIB systematically covers diverse relation types and scene complexities through a decoupled data regime. This consists of a 60K-pair -labeled Silver Set for scalable metric training and a 4K-pair double-blind Human Evaluation Gold Set covering a diverse range of state-of-the-art generators, with the Silver Set reaching 95. 1% cross-VLM preference agreement.
To demonstrate the utility of this benchmark, we present MIE, a lightweight, reference-conditioned evaluator trained exclusively on the Silver Set with a dual-head ranking and diagnosis objective. MIE exhibits strong cross-generator generalization on the Gold Set, achieving 0. 922 overall pairwise accuracy against human preference, including 0. 982 on seen generators and 0. 884 on unseen generators.
By outperforming a broad spectrum of baseline metrics, including CLIP and DINO variants, MIE demonstrates that diagnostic supervision can preserve ranking separability and human alignment where traditional evaluators collapse.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.