Memory-Augmented LSTM Autoencoder for Unsupervised Activity Recognition with IMU Sensor Fusion
Quick Answer
The proposed memory-augmented LSTM autoencoder framework achieves 96.6% and 98.4% accuracy on DaLiAc and PAMAP2 datasets, respectively, outperforming both supervised and unsupervised methods in unsupervised human activity recognition using IMU sensor fusion.
Quick Take
The proposed memory-augmented LSTM autoencoder framework achieves 96.6% and 98.4% accuracy on DaLiAc and PAMAP2 datasets, respectively, outperforming both supervised and unsupervised methods in unsupervised human activity recognition using IMU sensor fusion. This approach effectively captures spatiotemporal dependencies despite challenges like noisy data and overlapping activities.
Key Points
- Introduces a fully unsupervised spatiotemporal feature fusion framework.
- Utilizes a memory-augmented autoencoder for enhanced activity representation.
- Achieves up to 9% improvement in feature separability with shorter temporal windows.
- Evaluated on realistic inter-class window segmentation for practical relevance.
- Surpasses traditional supervised baselines in accuracy.
Paper Resources
📖 Reader Mode
~2 min readAbstract:HAR using Inertial Measurement Unit (IMU) sensors is vital for healthcare monitoring and rehabilitation. Despite deep learning advancements, major challenges remain: reliance on labeled data, multi-sensor fusion complexity, and the limited ability of unsupervised methods to capture spatiotemporal dependencies. These issues are pronounced in real-world scenarios with noisy data, overlapping activities, and missing labels. We propose a fully unsupervised spatiotemporal feature fusion framework using a memory-augmented autoencoder. It enhances activity representations via short temporal windows of multi-sensor IMU data, enabling real-time applications. Our framework extracts hierarchical static features via a Stacked Autoencoder, fusing them within and across sensors. A sequence-to-sequence LSTM Autoencoder then temporally refines these features, incorporating historical motion patterns without labels. We analyze key hyperparameters to identify configurations that maximize feature separability under short-window constraints. Evaluated on DaLiAc and PAMAP2 using realistic inter-class window segmentation, our method achieves 96.6% and 98.4% accuracy, respectively, surpassing supervised baselines and unsupervised approaches. Our method improves feature separability by up to 9% despite shorter temporal windows. While our realistic inter-class segmentation reduces accuracy by ~7%, it was intentionally adopted to better reflect real-world activity transitions and practical relevance.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.28377 [cs.CV] |
| (or arXiv:2606.28377v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28377 arXiv-issued DOI via DataCite |
Submission history
From: Saeed Arabzadeh [view email]
[v1]
Fri, 19 Jun 2026 06:28:20 UTC (1,535 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.