How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?
Quick Answer
This paper shows that Self-supervised geospatial foundation models (GeoFMs) exhibit variable transferability across tasks like classification and segmentation, with layerwise probing revealing that intermediate transformer blocks hold more task-relevant information than final-layer embeddings.
Quick Take
Self-supervised geospatial foundation models (GeoFMs) exhibit variable transferability across tasks like classification and segmentation, with layerwise probing revealing that intermediate transformer blocks hold more task-relevant information than final-layer embeddings. Adaptation settings, such as decoder design, significantly impact performance, indicating a need for representation-aware evaluation strategies.
Key Points
- GeoFMs evaluated include joint-embedding, reconstruction, and multimodal pretraining models.
- Model rankings vary significantly across different downstream tasks and adaptation settings.
- Intermediate transformer blocks provide more accessible task-relevant information than final-layer embeddings.
- Decoder design and fine-tuning can be as impactful as the choice of GeoFM.
- CKA analysis reveals fine-tuning effects are localized to the first linear layer of MLP in ViT blocks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 13896v1 Announce Type: new Abstract: Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines.
We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles.
In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks.
These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.