KidRisk: Benchmark Dataset for Children Dangerous Action Recognition
Quick Answer
This paper shows that The KidRisk dataset, comprising 2,500 videos and 10,000 images, enables improved recognition of children's dangerous actions, achieving 83.53% accuracy in action classification and 96.14% in danger recognition using vision-language models, outperforming traditional deep learning methods.
Quick Take
The KidRisk dataset, comprising 2,500 videos and 10,000 images, enables improved recognition of children's dangerous actions, achieving 83.53% accuracy in action classification and 96.14% in danger recognition using , outperforming traditional deep learning methods.
Key Points
- KidRisk dataset includes 2,500 videos and 10,000 images of children's actions.
- Traditional deep learning models showed limited effectiveness on dangerous action recognition.
- Vision-language models achieved 83.53% accuracy in classifying children's actions.
- Dangerous action recognition accuracy reached 96.14% with proposed methods.
- The dataset contributes to enhancing children's safety by identifying risky behaviors.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Children are naturally energetic, and during their spontaneous activities, they often encounter potentially dangerous situations, especially when lacking parental supervision. Identifying actions that pose risks plays a crucial role in ensuring their safety. This paper build a novel challenging dataset, namely KidRisk, including 2,500 short videos of children's actions and 10,000 images for dangerous action of children. We also introduce a benchmark on our newly constructs dataset and find that traditional deep learning models demonstrated limited effectiveness on these tasks. Therefore, we develop vision-language based baselines with exceptional context understanding of visual information. Our proposed methods achieved an accuracy of 83.53% in classifying children's actions and 96.14% in recognizing children's dangerous actions, significantly outperforming traditional approaches. These results confirm that vision-language models are not only feasible but also highly effective in detecting hazardous actions, contributing positively to safeguarding children's safety.
| Comments: | SOICT 2024 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2606.25298 [cs.CV] |
| (or arXiv:2606.25298v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.25298 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Trung Nghia Le [view email]
[v1]
Wed, 24 Jun 2026 01:59:29 UTC (904 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.