KidRisk: Benchmark Dataset for Children Dangerous Action Recognition

arXiv cs.CV·Minh-Kha Nguyen, Trung-Hieu Do, Kim Anh Phung, Thao Thi Phuong Dao, Minh-Triet Tran, Trung-Nghia Le

6d ago

·~2 min·6/25/2026·en·2

Quick Answer

This paper shows that The KidRisk dataset, comprising 2,500 videos and 10,000 images, enables improved recognition of children's dangerous actions, achieving 83.53% accuracy in action classification and 96.14% in danger recognition using vision-language models, outperforming traditional deep learning methods.

Quick Take

The KidRisk dataset, comprising 2,500 videos and 10,000 images, enables improved recognition of children's dangerous actions, achieving 83.53% accuracy in action classification and 96.14% in danger recognition using , outperforming traditional deep learning methods.

Key Points

KidRisk dataset includes 2,500 videos and 10,000 images of children's actions.
Traditional deep learning models showed limited effectiveness on dangerous action recognition.
Vision-language models achieved 83.53% accuracy in classifying children's actions.
Dangerous action recognition accuracy reached 96.14% with proposed methods.
The dataset contributes to enhancing children's safety by identifying risky behaviors.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 24 Jun 2026]

View PDF HTML (experimental)

Abstract:Children are naturally energetic, and during their spontaneous activities, they often encounter potentially dangerous situations, especially when lacking parental supervision. Identifying actions that pose risks plays a crucial role in ensuring their safety. This paper build a novel challenging dataset, namely KidRisk, including 2,500 short videos of children's actions and 10,000 images for dangerous action of children. We also introduce a benchmark on our newly constructs dataset and find that traditional deep learning models demonstrated limited effectiveness on these tasks. Therefore, we develop vision-language based baselines with exceptional context understanding of visual information. Our proposed methods achieved an accuracy of 83.53% in classifying children's actions and 96.14% in recognizing children's dangerous actions, significantly outperforming traditional approaches. These results confirm that vision-language models are not only feasible but also highly effective in detecting hazardous actions, contributing positively to safeguarding children's safety.

Comments:	SOICT 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2606.25298 [cs.CV]
	(or arXiv:2606.25298v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2606.25298 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Trung Nghia Le [view email]
[v1] Wed, 24 Jun 2026 01:59:29 UTC (904 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

3w ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup