RSGPNet: Geometric Prompting for Remote Sensing Open-Vocabulary Semantic Segmentation

arXiv cs.CV·Shanwen Wang, Xin Sun, Sirui Wang, Xiao Xiang Zhu

1d ago

·~2 min·6/30/2026·en·0

Quick Answer

RSGPNet introduces a training-free geometric prompting framework for open-vocabulary semantic segmentation in remote sensing, significantly enhancing segmentation accuracy through a novel combination of text-guided coarse masks, geometric re-prompting, and consistency verification.

Quick Take

Key Points

RSGPNet consists of three core modules: TCM, GRP, and CVM.
TCM generates initial coarse segmentation masks using text prompts and images.
GRP refines these masks by converting them into geometric box prompts.
CVM ensures consistency, preventing errors from being reinforced during prompting.
RSGPNet shows superior performance on remote sensing datasets compared to state-of-the-art methods.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 25 Jun 2026]

View PDF HTML (experimental)

Abstract:Open-vocabulary semantic segmentation (OVSS) enables text-guided segmentation of unseen objects, breaking fixed-class limitations to achieve open-world understanding. However, existing OVSS methods primarily focus on modifying the CLIP attention mechanism, which still suffers from unstable local segmentation for remote sensing (RS) domain. To address these limitations, we propose RSGPNet, a training-free geometric prompting framework for RS OVSS that refines segmentation by leveraging object geometric areas and consistency constraints. Specifically, RSGPNet comprises three core modules: a Text-guided Coarse Mask module (TCM), a Geometric Re-prompting Module (GRP), and a Coarse-to-fine Consistency Verification Mechanism (CVM). TCM utilizes text prompts and the input image to construct initial coarse segmentation masks. GRP then converts these coarse masks into geometric box prompts, feeding them back into the segmentation model to generate refined masks. Finally, CVM employs consistency computation to prevent prompting from reinforcing erroneous regions. They allow the model to improve segmentation accuracy in complex areas, such as category boundaries. Extensive experiments on RS datasets demonstrate that RSGPNet significantly outperforms state-of-the-art methods across both quantitative and qualitative metrics while exhibiting excellent interpretability. The code is released at \href{this https URL}{this https URL}.

Comments:	Open-vocabulary, Remote sensing, Geometric prompting, Multimodal large language model
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.28410 [cs.CV]
	(or arXiv:2606.28410v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2606.28410 arXiv-issued DOI via DataCite

Submission history

From: Shanwen Wang [view email]
[v1] Thu, 25 Jun 2026 03:52:43 UTC (12,226 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup