Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

arXiv cs.CV·Niels Sombekke, Rob G. J. Wijnhoven, Martin R. Oswald

3d ago

·~1 min·5/27/2026·en·0

Quick Take

The study introduces a multi-modal classification framework using Perceiver IO to fuse satellite and street-level imagery for building inspection, achieving significant performance improvements in roof element classification. A dataset of 32,135 buildings was created, with the RGB-M masking strategy enhancing results, yielding up to +11.3 AP for slate attributes. This flexible architecture supports various input types and multiple output tasks.

Key Points

Perceiver IO architecture fuses satellite and street-level imagery for building inspection.
Dataset includes 32,135 buildings with up to eight street views per segment.
RGB-M masking strategy enhances performance, outperforming hard cropping methods.
Model shows +11.3 AP improvement for slate attributes from street-level views.
Architecture accommodates heterogeneous inputs and multiple output tasks.

Article Content

From source RSS / original summary

arXiv:2605. 26381v1 Announce Type: new Abstract: We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes.

We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities.

The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e. g. , +11. 3 AP for slate, +1. 3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Related in this space

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

TorqueAGI Announces Collaborations with NVIDIA, John Deere, and Dexterity to Advance Physical AI for Enterprise-Grade Robots

FORT Robotics Acquires Mapless AI to Expand Its Trust Platform with Remote Supervision and Active Safety Capabilities