Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking
Quick Take
Pinpoint introduces a novel retrieve-and-rerank architecture that synergizes internet photos and street-view imagery for improved geolocation accuracy. Achieving state-of-the-art results on benchmarks like IM2GPS3k and OSV-5M, it outperforms prior methods without relying on multimodal large-language models, enhancing inference speed and reproducibility.
Key Points
- Pinpoint combines user-uploaded Flickr photos with street-view imagery for enhanced geolocation.
- Utilizes a contrastive image-GPS embedder for shared embedding space and candidate retrieval.
- Achieves state-of-the-art performance on IM2GPS3k and OSV-5M benchmarks.
- Reranking incorporates visual and GPS features alongside cross-source evidence.
- Faster and more reproducible than previous multimodal large-language model approaches.
Article Content
From source RSS / original summaryarXiv:2606. 04133v1 Announce Type: new Abstract: Image geolocation aims to estimate where a photograph was taken from its visual content. At worldwide scale, this remains challenging because visual evidence is often ambiguous, diverse, and unevenly distributed.
Prior work has typically treated geolocation of ordinary internet photos and street-view imagery as separate tasks, despite their complementary strengths: internet photos better match the appearance distribution of user-captured queries, while street-view imagery provides denser, geographically grounded coverage. We present Pinpoint, a retrieve-and-rerank architecture that combines both sources in a coarse-to-fine pipeline.
A contrastive image-GPS embedder is trained on both user-uploaded Flickr photos and street-view imagery, learning a shared image-GPS embedding space that is used to retrieve candidate locations. An attention-based reranker then rescores retrieved candidates by combining candidate-level visual and GPS features with cross-source evidence from nearby locations to ground the prediction. Unlike recent prior work, Pinpoint does not rely on multimodal large-language models, making inference faster and more reproducible.
Pinpoint achieves state-of-the-art results across all metrics on standard benchmarks for internet photos (IM2GPS3k and YFCC4k) and street-view imagery (OSV-5M).
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Optimal Transport Flow Matching by Design
The study presents a novel approach to optimal transport (OT) flow matching, reformulating the problem by treating the prior as a design choice. This method achieves over 2x reduction in trajectory curvature compared to existing methods, improving generation quality in few-step regimes without altering the flow model. The approach integrates seamlessly with latent-space models and classifier-free guidance.