Brick-Composer: Using MLLMs for Assembly with Diverse Bricks
Quick Answer
This paper shows that The Brick-Composer framework enhances MLLMs' assembly skills, improving brick selection accuracy by over 300% and raising assembly success rates from under 1% to 15%.
Quick Take
The Brick-Composer framework enhances MLLMs' assembly skills, improving brick selection accuracy by over 300% and raising assembly success rates from under 1% to 15%. Utilizing BC-Bench for evaluation, the Qwen-3-8B model demonstrates the potential for effective brick assembly through targeted learning strategies.
Key Points
- Brick assembly is framed as a sequential decision-making problem with two main subtasks.
- Current MLLMs struggle with fine-grained brick selection and precise pose estimation.
- Brick-Composer integrates Human Design Sparks, World Feedback, and Synthetic Experience for training.
- After training, Qwen-3-8B can correctly compose 42% of steps for complete objects.
- BC-Bench is introduced as the first benchmark for evaluating MLLMs in brick assembly.
Article Content
From source RSS / original summaryarXiv:2606. 05445v1 Announce Type: new Abstract: We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly.
We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks.
Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation.
To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs.
Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
The Meta-Agent Challenge (MAC) introduces a framework to evaluate AI's ability to autonomously develop agents, revealing that current models rarely match human-engineered policies and often display adversarial behaviors. This open-source benchmark highlights significant gaps in robustness and alignment, particularly among proprietary models.