How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

arXiv cs.CV·Julia Romero, Qin Lv, Morteza Karimzadeh

6h ago

·~1 min·6/15/2026·en·0

Quick Answer

This paper shows that Self-supervised geospatial foundation models (GeoFMs) exhibit variable transferability across tasks like classification and segmentation, with layerwise probing revealing that intermediate transformer blocks hold more task-relevant information than final-layer embeddings.

Quick Take

Self-supervised geospatial foundation models (GeoFMs) exhibit variable transferability across tasks like classification and segmentation, with layerwise probing revealing that intermediate transformer blocks hold more task-relevant information than final-layer embeddings. Adaptation settings, such as decoder design, significantly impact performance, indicating a need for representation-aware evaluation strategies.

Key Points

GeoFMs evaluated include joint-embedding, reconstruction, and multimodal pretraining models.
Model rankings vary significantly across different downstream tasks and adaptation settings.
Intermediate transformer blocks provide more accessible task-relevant information than final-layer embeddings.
Decoder design and fine-tuning can be as impactful as the choice of GeoFM.
CKA analysis reveals fine-tuning effects are localized to the first linear layer of MLP in ViT blocks.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 13896v1 Announce Type: new Abstract: Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines.

We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles.

In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks.

These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup