HANCLIP: A Family of Hyperbolic Angular Negation Vision Language Models
Quick Answer
HANCLIP introduces a new family of Vision-Language Models that enhance negation sensitivity by restructuring the embedding space.
Quick Take
HANCLIP introduces a new family of that enhance negation sensitivity by restructuring the embedding space. Trained on 20,000 image-text quadruplets, it shows improved performance on the NegBench benchmark while maintaining competitive results on standard tasks. This model-agnostic framework can be integrated into existing models like CLIP without extensive retraining.
Key Points
- HANCLIP explicitly encodes 'what an image is not' alongside 'what it is'.
- Utilizes a hyperbolic formulation to model hierarchical semantic relations.
- Achieves consistent gains on the NegBench benchmark focused on negation.
- Maintains competitive performance on standard classification and retrieval tasks.
- Can be integrated into existing models like CLIP without large-scale retraining.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Vision-Language Models (VLMs) are typically pre-trained on large-scale image-text datasets to capture semantic correspondences between visual content and natural language. However, they remain surprisingly brittle to negation: models often rely on shallow word co-occurrence and are easily distracted by misleading or irrelevant textual cues, even when their overall retrieval or classification performance is strong. Moreover, directly finetuning on negation data can interfere with previously acquired knowledge, causing noticeable degradation on standard vision-language benchmarks. To tackle these issues, this work introduces HANCLIP (Hyperbolic + Angular + Negation), a family of VLMs that explicitly restructures the embedding space to encode "what an image is not" alongside "what it is." HANCLIP is trained on a compact set of 20,000 image-text quadruplets and combines a hyperbolic formulation, which models hierarchical semantic relations and asymmetries, with an angular triplet objective that drives systematic separation between negated descriptions and their corresponding positives. This geometry-aware design strengthens negation sensitivity while preserving the global structure of pretrained representations, rather than overwriting them. Extensive experiments across multiple vision-language tasks show that HANCLIP delivers consistent gains on the negation-focused NegBench benchmark, while maintaining competitive or improved performance on standard classification and image-text retrieval benchmarks. The framework is model-agnostic and can be plugged into CLIP, LongCLIP, SmartCLIP, and HiMo-CLIP without large-scale retraining, demonstrating that a carefully designed geometric objective can substantially extend the reasoning capabilities of existing VLMs using only modest additional data.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) |
| Cite as: | arXiv:2606.23843 [cs.CV] |
| (or arXiv:2606.23843v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.23843 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Hoang-Bao Le [view email]
[v1]
Mon, 22 Jun 2026 18:25:37 UTC (16,150 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.