Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery
Quick Take
This study introduces a method for joint instance segmentation of roof structures and geometric attribute regression using a modified Mask R-CNN. The approach achieves a mean absolute error of 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0.566, utilizing a large-scale dataset of Dutch aerial images.
Key Points
- Method extends Mask R-CNN with an attribute regression branch.
- Conditional azimuth loss suppresses supervision for flat roof segments.
- Log-normalized height representation addresses skewed building height distribution.
- Achieves 0.566 AP$_{50}$ for instance segmentation on Dutch aerial images.
- Enables simplified 3D building model reconstruction from a single image.
Article Content
From source RSS / original summaryarXiv:2605. 26370v1 Announce Type: new Abstract: We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes -- building height, roof slope, and roof azimuth -- from a single aerial orthophoto.
Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset.
Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0. 566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
