Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

arXiv cs.CV·Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan

1d ago

·~1 min·6/12/2026·en·1

Quick Answer

BabyMind introduces an object-first inductive bias for grounding language in child-view video, improving Labeled-S 15 accuracy by +2.6 points over CVCL.

Quick Take

BabyMind introduces an object-first inductive bias for grounding language in child-view video, improving Labeled-S 15 accuracy by +2.6 points over CVCL. The model effectively addresses ambiguities in infant-view recordings, enhancing performance on out-of-distribution benchmarks. Code is available on GitHub.

Key Points

BabyMind uses an object-first approach for contrastive learning in noisy supervision.
It links candidate object embeddings across short utterances for improved accuracy.
The model stabilizes learning with track-coherence and global-object agreement regularizers.
BabyMind shows consistent gains on in-vocabulary out-of-distribution benchmarks.
Code is publicly available at https://github.com/sathiiii/BabyMind.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2606. 12985v1 Announce Type: new Abstract: Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors.

We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective.

Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2. 6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github. com/sathiiii/BabyMind.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

1w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup