ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

arXiv cs.CV·Anindya Mondal, Sauradip Nag, Anjan Dutta

4h ago

·~1 min·6/24/2026·en·0

Quick Answer

ABACUS is a unified vision-language model that excels in object and crowd counting, as well as count-faithful image generation, achieving state-of-the-art results across seven benchmarks without benchmark-specific training.

Quick Take

ABACUS is a unified that excels in object and crowd counting, as well as count-faithful image generation, achieving state-of-the-art results across seven benchmarks without benchmark-specific training. It incorporates innovations like density-aware adaptive zooming and a cycle-consistent GRPO strategy, outperforming both task-specific models and larger generalist models.

Key Points

ABACUS uses a 3B-parameter unified foundation model for enhanced performance.
Innovations include density-aware adaptive zooming and boundary-aware count policy.
Achieves state-of-the-art results across seven benchmarks.
Outperforms both task-specific specialists and larger generalist models.
No external annotations required for understanding and generation tasks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 23835v1 Announce Type: new Abstract: ABACUS is a unified that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific training required.

Our model is built on existing 3B-parameter unified foundation model and is adapted for object localization tasks using three key innovations: density-aware adaptive zooming with objectness maps for spatial grounding; a boundary-aware count policy via GRPO to eliminate crop-boundary errors; and a cycle-consistent GRPO strategy where the understanding branch self-critiques generated outputs, closing the understanding-generation gap without any external annotations.

ABACUS achieves state-of-the-art results across seven benchmarks, outperforming both task-specific specialists and larger generalist models.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

2w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup