Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

arXiv cs.CV·Dmytro Klepachevskyi, Alexander Wong, Sirisha Rambhatla, Yuhao Chen

2h ago

·~1 min·5/27/2026·en·0

Quick Take

The study introduces a multi-stage SAM3 feature fusion method for zero-shot object re-identification in egocentric kitchen videos.

Key Points

Challenges include viewpoint changes and occlusions.
Enhanced SAM3 pipeline improves mAP from 45.3% to 52.8%.
Combines SAM3, DINOv2, and CLIP for better feature extraction.

Article Content

From source RSS / original summary

arXiv:2605. 26383v1 Announce Type: new Abstract: Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches.

We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45. 3% mAP.

We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7. 5% mAP to 52. 8%.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning