RSGPNet: Geometric Prompting for Remote Sensing Open-Vocabulary Semantic Segmentation
Quick Answer
RSGPNet introduces a training-free geometric prompting framework for open-vocabulary semantic segmentation in remote sensing, significantly enhancing segmentation accuracy through a novel combination of text-guided coarse masks, geometric re-prompting, and consistency verification.
Quick Take
RSGPNet introduces a training-free geometric prompting framework for open-vocabulary semantic segmentation in remote sensing, significantly enhancing segmentation accuracy through a novel combination of text-guided coarse masks, geometric re-prompting, and consistency verification. Extensive experiments show RSGPNet outperforms existing methods in both quantitative and qualitative metrics.
Key Points
- RSGPNet consists of three core modules: TCM, GRP, and CVM.
- TCM generates initial coarse segmentation masks using text prompts and images.
- GRP refines these masks by converting them into geometric box prompts.
- CVM ensures consistency, preventing errors from being reinforced during prompting.
- RSGPNet shows superior performance on remote sensing datasets compared to state-of-the-art methods.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Open-vocabulary semantic segmentation (OVSS) enables text-guided segmentation of unseen objects, breaking fixed-class limitations to achieve open-world understanding. However, existing OVSS methods primarily focus on modifying the CLIP attention mechanism, which still suffers from unstable local segmentation for remote sensing (RS) domain. To address these limitations, we propose RSGPNet, a training-free geometric prompting framework for RS OVSS that refines segmentation by leveraging object geometric areas and consistency constraints. Specifically, RSGPNet comprises three core modules: a Text-guided Coarse Mask module (TCM), a Geometric Re-prompting Module (GRP), and a Coarse-to-fine Consistency Verification Mechanism (CVM). TCM utilizes text prompts and the input image to construct initial coarse segmentation masks. GRP then converts these coarse masks into geometric box prompts, feeding them back into the segmentation model to generate refined masks. Finally, CVM employs consistency computation to prevent prompting from reinforcing erroneous regions. They allow the model to improve segmentation accuracy in complex areas, such as category boundaries. Extensive experiments on RS datasets demonstrate that RSGPNet significantly outperforms state-of-the-art methods across both quantitative and qualitative metrics while exhibiting excellent interpretability. The code is released at \href{this https URL}{this https URL}.
| Comments: | Open-vocabulary, Remote sensing, Geometric prompting, Multimodal large language model |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.28410 [cs.CV] |
| (or arXiv:2606.28410v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28410 arXiv-issued DOI via DataCite |
Submission history
From: Shanwen Wang [view email]
[v1]
Thu, 25 Jun 2026 03:52:43 UTC (12,226 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.