CuriosAI Submission to the CASTLE Challenge at… · DeepSignal

CuriosAI Submission to the CASTLE Challenge at EgoVis 2026

arXiv cs.CV·Yuto Kanda, Hayato Tanoue, Takayuki Hori

2d ago

·~1 min·5/28/2026·en·0

Quick Take

CuriosAI submitted two approaches for the CASTLE Challenge at EgoVis 2026, achieving a leaderboard accuracy of 0.50 with the SVA model and 0.35 with TMKG. The SVA method utilizes a three-stage pipeline for verifying answers, while TMKG constructs a temporal multimodal knowledge graph.

Key Points

SVA (Search-Verify-Answer) uses a hierarchical pipeline for answer verification.
TMKG (Temporal-Multimodal-Knowledge-Graph) builds a knowledge graph for answer generation.
SVA achieved a leaderboard accuracy of 0.50, while TMKG reached 0.35.
The challenge involved answering 185 questions from over 600 hours of egocentric video.
Both approaches utilize a shared multimodal preprocessing layer.

Article Excerpt

From source RSS / original summary

arXiv:2605. 27800v1 Announce Type: new Abstract: CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles.

Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.

50 and is our final challenge submission; TMKG reaches 0. 35.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source