EarthShift: a benchmark for measuring robustness to real-world distribution shifts in Earth observation

arXiv cs.CV·Kelsey Doerksen, Hannah Kerner

1d ago

·~2 min·5/29/2026·en·1

Quick Take

EarthShift is the first public benchmark for assessing robustness in Earth observation models against real-world distribution shifts. Testing on 8 geospatial foundation models reveals a consistent 15-20% performance drop in out-of-distribution scenarios, emphasizing the need for improved distributional robustness in future research. The code and datasets are publicly available to facilitate further advancements.

Key Points

EarthShift benchmarks robustness across various distribution shifts in remote sensing.
8 geospatial foundation models tested show 15-20% worse performance out-of-distribution.
Robustness findings are similar for generic vision foundation models and fully-supervised models.
The benchmark aims to guide future research towards reliable real-world applications.
Code and datasets are available at https://earthshift.github.io.

Article Content

From source RSS / original summary

arXiv:2605. 29330v1 Announce Type: new Abstract: Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios, such as new time periods, geographies, scales, and sensors. We introduce EarthShift: the first public testbed for benchmarking robustness across multiple realistic distribution shifts encountered in remote sensing.

EarthShift enables users to measure distributional robustness by comparing performance in- and out-of-distribution using datasets from paired datasets from different sources, temporal windows, geographic locations, and sensors. Our experiments on 8 geospatial foundation models (GFMs) and 11 tasks covering 5 shift types show that GFMs consistently perform 15-20% worse out-of-distribution on average regardless of model architecture, size, pre-training or fine-tuning strategy.

We show that GFM robustness is similar to that of generic vision foundation models, and even fully-supervised models. This highlights a need for future research to strive for improvements in distributional robustness, not just performance, which can be benchmarked using EarthShift. We release our code and datasets to provide a testbed to guide future work to create foundation models that are robust and reliable in real-world applications. Code and data for EarthShift are available at: https://earthshift. github. io

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source