Multi-Stage VLM Pipeline for Zero-Shot Traffic Accident Understanding
Quick Take
The first-place solution for the CVPR 2026 AUTOPILOT Workshop's ACCIDENT challenge employs a multi-stage VLM pipeline using Qwen3-VL-32B-Instruct and a 235B Mixture-of-Experts model, achieving a Public LB score of 0.55469 and a Private LB score of 0.57080, outperforming the baseline by +0.21. The system predicts accident timing, impact centroid, and collision type from CCTV footage.
Key Points
- Utilizes a three-stage pipeline for joint prediction and refinement of accident data.
- Combines outputs from two models with a 9:1 blending ratio for improved accuracy.
- Final system outperforms the strongest baseline model (Molmo-7B) significantly.
- Code for the solution is publicly available on GitHub.
- Results demonstrate effective zero-shot learning capabilities in traffic accident analysis.
Article Excerpt
From source RSS / original summaryarXiv:2605. 29325v1 Announce Type: new Abstract: We present the 1st-place solution to the ACCIDENT challenge at the CVPR 2026 AUTOPILOT Workshop, which asks for zero-shot prediction of accident timing, impact centroid, and collision type from CCTV footage.
On a frozen Qwen3-VL-32B-Instruct checkpoint we build a three-stage pipeline (full-video joint prediction, time refinement, and single-frame grounding of the impact centroid), run the same pipeline a second time on a 235B Mixture-of-Experts sibling, blend the two outputs 9:1, and finally snap each predicted point onto the nearest vehicle detection. The final system reaches Public LB 0. 55469 / Private LB 0. 57080, roughly +0. 21 over the strongest host baseline (Molmo-7B, 0. 358) and wins the challenge.
We ablate each component, report the negative results that shaped the final design, and release the code at https://github. com/fuumin621/cvpr2026-accident-1st-place-solution.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.