DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion
Quick Take
DiffCrossGait introduces a novel approach for 2D-3D gait recognition by aligning trajectories in a latent diffusion space, achieving state-of-the-art results on SUSTech1K and FreeGait benchmarks. This method enhances modality-invariant gait features while ensuring efficient inference by decoupling generative alignment from the discriminative backbone.
Key Points
- DiffCrossGait reformulates cross-modal matching as trajectory-level alignment.
- Utilizes shared Gaussian noise for continuous alignment in latent space.
- Introduces Tri-Phase Alignment Strategy for identity anchoring and dynamics consistency.
- Decouples generative alignment from discriminative backbone for efficient inference.
- Achieves state-of-the-art performance on SUSTech1K and FreeGait datasets.
Article Content
From source RSS / original summaryarXiv:2606. 00153v1 Announce Type: new Abstract: Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representations. While prior methods align only final embeddings, we propose DiffCrossGait, which reformulates cross-modal matching as trajectory-level alignment in an identity-relevant latent diffusion space, rather than assuming full equivalence between 2D and 3D observations.
By driving both modalities with shared Gaussian noise within a latent space, we enable continuous alignment throughout the generative evolution. We introduce a Tri-Phase Alignment Strategy that exploits varying noise intensities to enforce identity anchoring, dynamics consistency, and cross-modal structural recoverability, thereby constraining both modalities to share denoising dynamics and bottleneck structure, which promotes modality-invariant gait features.
Crucially, our framework decouples generative alignment from the discriminative backbone; the diffusion mechanism serves exclusively as a training objective, ensuring high inference efficiency by eliminating the computational overhead of iterative denoising. Extensive experiments on the SUSTech1K and FreeGait benchmarks demonstrate that DiffCrossGait achieves state-of-the-art performance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, enabling efficient fine-tuning with only 0.11% parameter updates. It significantly enhances performance in few-shot learning and domain shifts across 15 biomedical imaging datasets, demonstrating robustness for clinical applications.