Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability
Quick Answer
This paper introduces a scalable framework utilizing LLMs, specifically GPT-4o-mini, for implicit sentiment analysis of product desirability, achieving up to 94% classification accuracy and 0.97 Pearson correlation, while being 94% cheaper than larger models.
Quick Take
This paper introduces a scalable framework utilizing LLMs, specifically GPT-4o-mini, for implicit sentiment analysis of product desirability, achieving up to 94% classification accuracy and 0.97 Pearson correlation, while being 94% cheaper than larger models. The approach enhances interpretability and trust, making it suitable for practical product evaluations.
Key Points
- LLMs generated numerical sentiment scores closely matching expert labels with high Pearson correlations.
- GPT-4o-mini performed comparably to larger models at 94% lower cost.
- The framework includes model confidence ratings and human-readable explanations for better interpretability.
- Zero-shot sentiment analysis was conducted without relying on explicit review scores.
- Results support practical applications in product satisfaction assessment and marketing strategies.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 23701v1 Announce Type: new Abstract: Qualitative product feedback can reveal nuanced user experiences, but its implicit sentiment is difficult to measure. This paper presents a scalable and interpretable framework that uses large language models (LLMs) to quantify product desirability from such data.
Using two Product Desirability Toolkit (PDT) datasets from ZORQ and CARMA comprising 106 respondent term groupings with gold-standard human annotation, zero-shot continuous numerical sentiment scoring and categorical sentiment classification are evaluated without relying on explicit review scores. Across the datasets, LLMs generated numerical sentiment scores directly from qualitative responses and closely matched expert labels, achieving Pearson correlations up to 0. 97 and classification accuracy up to 94%.
LLMs maintained robustness even when handling data presented in multiple forms and consistently expressed high confidence. In contrast, lexicon-based and transformer baselines did not produce statistically significant results. Among the models tested, GPT-4o-mini achieved performance comparable to larger models at 94% lower cost, supporting scalable deployment.
The framework also incorporates model confidence ratings and human-readable rationale explanations (xAI), improving interpretability, transparency, and trust while supporting practical use in product satisfaction assessment.
In general, using the PDT tool as a survey method along with a cost efficient LLM for sentiment analysis has the potential to provide for product evaluation with results that are rich in terms of sentiment scores (both numerical and classified sentiment) and in terms of the high-level user impressions of the product that can be used to identify ideas for product development and improvement, as well as marketing ideas for target audiences.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.