ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation
Quick Answer
ABACUS is a unified vision-language model that excels in object and crowd counting, as well as count-faithful image generation, achieving state-of-the-art results across seven benchmarks without benchmark-specific training.
Quick Take
ABACUS is a unified that excels in object and crowd counting, as well as count-faithful image generation, achieving state-of-the-art results across seven benchmarks without benchmark-specific training. It incorporates innovations like density-aware adaptive zooming and a cycle-consistent GRPO strategy, outperforming both task-specific models and larger generalist models.
Key Points
- ABACUS uses a 3B-parameter unified foundation model for enhanced performance.
- Innovations include density-aware adaptive zooming and boundary-aware count policy.
- Achieves state-of-the-art results across seven benchmarks.
- Outperforms both task-specific specialists and larger generalist models.
- No external annotations required for understanding and generation tasks.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 23835v1 Announce Type: new Abstract: ABACUS is a unified that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required.
Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations.
ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.