ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

arXiv cs.CV·Kanchan Keisham, Thenukan Pathmanathan, Thangarajah Akilan

5h ago

·~1 min·6/1/2026·en·0

Quick Take

ConTrans introduces a novel multi-scale encoder for Zero-shot Temporal Action Localization, combining Conv and transformer techniques to enhance feature representation. It significantly outperforms existing methods on ActivityNet-1.3 and THUMOS14 datasets, setting a new benchmark for ZS-TAL.

Key Points

ConTrans addresses local correlations in video frames for improved action localization.
The model integrates convolutional biases with transformer self-attention mechanisms.
Experimental results show significant performance improvements on standard datasets.
ConTrans sets a new benchmark for Zero-shot Temporal Action Localization.

Article Excerpt

From source RSS / original summary

arXiv:2605. 30689v1 Announce Type: new Abstract: Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures.

In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.

3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

5d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.

#AI Coding #Inference #Open Source