3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models
Quick Answer
This paper shows that 3D-PLOT-LLM introduces a novel approach to part-level object tokens in 3D MLLMs, achieving superior performance on benchmarks like PartVerse-QA and 3DCoMPaT-GrIn with under 1M new trainable parameters.
Quick Take
3D-PLOT-LLM introduces a novel approach to part-level object tokens in 3D MLLMs, achieving superior performance on benchmarks like PartVerse-QA and 3DCoMPaT-GrIn with under 1M new trainable parameters. It surpasses existing models such as PointLLM and ShapeLLM, demonstrating significant improvements in part-aware tasks without the need for heavy segmentation decoders.
Key Points
- 3D-PLOT-LLM reorganizes input tokens for direct part addressing in 3D MLLMs.
- Achieved Jaccard 0.459 and Exact-match 13.78% on PartVerse-QA benchmark.
- Outperformed PointLLM and others on 3DCoMPaT-GrIn across all metrics.
- Added PartVerse-QA improved Objaverse captioning metrics by +0.65 SBERT.
- Utilizes under 1M new parameters, significantly less than previous models.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 19828v1 Announce Type: new Abstract: 3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary.
Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs.
To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0. 459 and Exact-match 13. 78%, with a slot-to-caption GPT-4o judge of 44. 68.
On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3. 03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0. 65 SBERT and +1. 85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective.
All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.