The Wiola Architecture for Efficient Small Language Models
Quick Answer
The Wiola architecture introduces a novel Small Language Model (SLM) with five unique components, enhancing efficiency and coherence.
Quick Take
The Wiola architecture introduces a novel Small Language Model (SLM) with five unique components, enhancing efficiency and coherence. Key innovations include Spiral Rotary Positional Encoding and Adaptive Token Merging, with performance benchmarks against GPT-2 and LLaMA-2. Released in four sizes, Wiola is fully compatible with HuggingFace Transformers.
Key Points
- Wiola features five novel components for enhanced model efficiency.
- Spiral Rotary Positional Encoding embeds token positions on a 3D helical manifold.
- Adaptive Token Merging reduces attention complexity without information loss.
- Wiola is released in sizes from 120M to 1.5B parameters.
- All architectural unit tests passed in the HuggingFace Transformers ecosystem.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2607. 01394v1 Announce Type: new Abstract: We present Wiola, a fully original Small Language Model (SLM) architecture built from first principles, sharing no structural lineage with any existing model family including GPT, LLaMA, Mistral, or Falcon.
Wiola introduces five independently novel components: (i) Spiral Rotary Positional Encoding (SRPE), which embeds token positions on a three-dimensional helical manifold combining absolute, relative, and hierarchical positional signals; (ii) Gated Cross-Layer Attention (GCLA), providing each decoder layer with soft cross-attention access to compressed summaries of two preceding layers for inter-layer coherence; (iii) Adaptive Token Merging (ATM), which dynamically merges se mantically redundant adjacent tokens in middle network layers to reduce attention complexity without information loss; (iv) Dual Stream Feed-Forward (DSFF), replacing the conventional MLP with two parallel streams fused by a learned per-dimension gate; and (v) WiolaRMSNorm, a modified normalisation introducing a per-dimension learned offset vector that prevents representation collapse.
We provide complete mathematical derivations, architectural block diagrams, complexity analyses, and systematic comparisons against GPT-2, LLaMA-2, and Mistral. Wiola is released in four sizes (120M, 360M, 700M, and 1. 5B parameters) and is fully compatible with the HuggingFace Transformers ecosystem, with all 22 architectural unit tests passing.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.