A comparative study of transformer-based… · DeepSignal

A comparative study of transformer-based embeddings for topic coherence

arXiv cs.CL·Alex Ding, Tarun Rapaka, Willy Rodriguez, Jason Yang

1d ago

·~1 min·5/29/2026·en·1

Quick Take

This study evaluates the impact of model size on topic coherence using seven transformer-based models, revealing that smaller models like MiniLM can achieve comparable topic quality to larger models such as LLaMA-2, despite parameter counts ranging from 22 million to 13 billion.

Key Points

Seven transformer models were analyzed, including MiniLM and LLaMA-2.
Topic quality was assessed using coherence and divergence metrics.
Model size had negligible impact on topic quality.
Smaller models can achieve performance comparable to larger ones.
Study contributes to understanding transformer-based embeddings in NLP.

Article Content

From source RSS / original summary

arXiv:2605. 28832v1 Announce Type: new Abstract: Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations.

It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora.

Topic quality is evaluated using coherence and divergence metrics following R{\"o}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

A comparative study of transformer-based embeddings for topic coherence

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective