Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection
Quick Take
This study enhances the DETR-based fusion transformer for the MaCVi 2026 challenge by introducing a dedicated MLP, QueryMLP, to predict buoy waterline contact points. This modification improved the model's performance, achieving an Overall score of 0.7386, F1 score of 0.8055, and mIoU of 0.6718, securing second place in the leaderboard.
Key Points
- Introduced QueryMLP to predict buoy waterline contact points from chart and IMU data.
- Reduced geometric reasoning burden on the transformer decoder with explicit spatial priors.
- Achieved an Overall score of 0.7386 in the MaCVi 2026 challenge leaderboard.
- F1 score reached 0.8055 and mIoU was 0.6718 on the held-out test set.
- Secured second place among all submissions in the competition.
Article Excerpt
From source RSS / original summaryarXiv:2605. 22942v1 Announce Type: new Abstract: This report presents a lightweight modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart data association challenge. The challenge baseline decoder receives per-buoy queries encoding world-space distance and bearing, forcing the transformer to implicitly learn the complex geometric projection from world coordinates to image pixels.
Instead, this work trains an additional dedicated MLP, QueryMLP, to explicitly predict the buoy's waterline contact point in the image from chart measurements and IMU orientation data. The predicted pixel coordinates are appended to the baseline decoder query vector, providing a direct spatial prior per buoy and reducing the geometric reasoning burden on the transformer decoder. On the challenge leaderboard, the presented approach achieves an Overall score of 0. 7386, with F1 = 0. 8055 and mIoU = 0.
6718, on the held-out test set, placing second among all submissions.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.
