PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis
Quick Answer
This study evaluates parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) on Qwen2.5-3B for telecommunications customer support, revealing that models with the lowest validation loss (0.5024) ranked only 6th-7th in qualitative assessments.
Quick Take
This study evaluates parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) on Qwen2.5-3B for telecommunications customer support, revealing that models with the lowest validation loss (0.5024) ranked only 6th-7th in qualitative assessments. The research highlights the importance of energy consumption analysis and the inadequacy of validation loss alone for selecting effective conversational AI configurations.
Key Points
- Introduced a synthetic data generation method using 52 industry-specific terms.
- Generated approximately 30,000 training examples across 1,560 problem scenarios.
- Evaluated 16 LoRA configurations, revealing a discrepancy between quantitative and qualitative performance.
- Best validation loss of 0.5024 ranked only 6th-7th in human assessments.
- Energy-performance trade-off analysis supports sustainable LLM deployment.
Article Content
From source RSS / original summaryarXiv:2606. 05176v1 Announce Type: new Abstract: While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain.
We present a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) applied to Qwen2. 5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2. 0 Flash.
We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5. 2 and Claude 4. 5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss (0.
5024) ranks only 6th-7th in qualitative evaluation, while the worst loss (0. 6807) ranks first according to both judges. This work contributes (1) a combinatorial method for synthetic dataset construction, (2) insights into the impact of target module selection for LoRA injection, (3) evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and (4) an energy-performance trade-off analysis for sustainable LLM deployment.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.