Multi-Stage VLM Pipeline for Zero-Shot Traffic Accident Understanding

arXiv cs.CV·Fumiya Tatematsu, Fumihiko Takahashi

1d ago

·~1 min·5/29/2026·en·0

Quick Take

The first-place solution for the CVPR 2026 AUTOPILOT Workshop's ACCIDENT challenge employs a multi-stage VLM pipeline using Qwen3-VL-32B-Instruct and a 235B Mixture-of-Experts model, achieving a Public LB score of 0.55469 and a Private LB score of 0.57080, outperforming the baseline by +0.21. The system predicts accident timing, impact centroid, and collision type from CCTV footage.

Key Points

Utilizes a three-stage pipeline for joint prediction and refinement of accident data.
Combines outputs from two models with a 9:1 blending ratio for improved accuracy.
Final system outperforms the strongest baseline model (Molmo-7B) significantly.
Code for the solution is publicly available on GitHub.
Results demonstrate effective zero-shot learning capabilities in traffic accident analysis.

Article Excerpt

From source RSS / original summary

arXiv:2605. 29325v1 Announce Type: new Abstract: We present the 1st-place solution to the ACCIDENT challenge at the CVPR 2026 AUTOPILOT Workshop, which asks for zero-shot prediction of accident timing, impact centroid, and collision type from CCTV footage.

On a frozen Qwen3-VL-32B-Instruct checkpoint we build a three-stage pipeline (full-video joint prediction, time refinement, and single-frame grounding of the impact centroid), run the same pipeline a second time on a 235B Mixture-of-Experts sibling, blend the two outputs 9:1, and finally snap each predicted point onto the nearest vehicle detection. The final system reaches Public LB 0. 55469 / Private LB 0. 57080, roughly +0. 21 over the strongest host baseline (Molmo-7B, 0. 358) and wins the challenge.

We ablate each component, report the negative results that shaped the final design, and release the code at https://github. com/fuumin621/cvpr2026-accident-1st-place-solution.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source