Phonological Perception of Sign Language Models
Quick Answer
This paper shows that Recent research evaluates Sign Language Recognition (SLR) models for American Sign Language (ASL), revealing that pose-based models excel in handshape sensitivity while pixel-based models are better at capturing location changes.
Quick Take
Recent research evaluates Sign Language Recognition (SLR) models for American Sign Language (ASL), revealing that pose-based models excel in handshape sensitivity while pixel-based models are better at capturing location changes. Despite showing emergent phonological sensitivity, the models' architectural biases limit their performance, indicating a need for improved training paradigms.
Key Points
- SLR models trained on ASL show emergent phonological sensitivity.
- Pose-based models excel in distinguishing handshape contrasts.
- Pixel-based models better capture changes in location.
- Latent representations from pose-based models correlate with human perceptual judgments (r~0.49).
- Current training paradigms are insufficient to overcome architectural biases.
Paper Resources
📖 Reader Mode
~2 min readAbstract:Sign languages are compositional systems where meaning arises by combining sublexical phonological parameters, such as handshape, location, and movement. While deep learning models for Sign Language Recognition (SLR) have achieved increased performance on translation benchmarks, it remains unclear whether these models distinguish abstract phonological features or merely rely on low-level statistical correlations. This work evaluates the phonological perception of SLR models trained on American Sign Language (ASL) by probing phonological sensitivity using minimal pairs and evaluating representational alignment with human behavioral data. Our results reveal that SLR models exhibit emergent phonological sensitivity, but with clear architectural trade-offs: pose-based models are sensitive to handshape contrasts, while pixel-based models better capture location changes. Furthermore, pose-based models learn latent representations that correlate with human perceptual similarity judgments (r~0.49). These findings suggest that while SLR models exhibit emergent phonology, current training paradigms are insufficient to scale them beyond their architectural inductive biases.
| Comments: | Accepted to CogSci 2026 |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.28667 [cs.CL] |
| (or arXiv:2606.28667v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28667 arXiv-issued DOI via DataCite |
Submission history
From: Kayo Yin [view email]
[v1]
Sat, 27 Jun 2026 01:02:35 UTC (7,695 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.