How Much Structure Do LLMs Need? Evaluating LLMs for Bibliometric Cluster Description
Quick Take
This study evaluates the effectiveness of bibliometric structure in enhancing LLM-assisted literature synthesis, revealing that while LLMs can generate human-like descriptions, they struggle with bibliometric structure without algorithmic support. The best results come from a hybrid approach where algorithms define clusters and LLMs interpret them.
Key Points
- LLMs show semantic similarity to human-written cluster descriptions but lack reliability in bibliometric structure.
- Performance improves significantly when bibliometric algorithms define clusters for LLM interpretation.
- The study utilized 100 published bibliometric analyses to assess LLM outputs.
- Human alignment, semantic coverage, and clustering quality were key evaluation metrics.
- Hybrid workflows combining algorithms and LLMs are recommended for effective bibliometric synthesis.
Article Excerpt
From source RSS / original summaryarXiv:2605. 24351v1 Announce Type: new Abstract: Large language models (LLMs) can support scientific literature synthesis, but remain prone to hallucinated references, uneven coverage, and weakly grounded thematic organization. We evaluate whether bibliometric structure improves LLM-assisted synthesis by comparing six pipelines for generating cluster descriptions under different levels of evidence and structure.
Using 100 published bibliometric analyses, we reconstruct Scopus corpora, extract human-written cluster descriptions, and assess outputs by human alignment, semantic coverage, clustering quality, graph quality, and reference grounding. Results show that LLMs produce descriptions semantically close to human-written ones, but are unreliable when asked to infer bibliometric structure from scratch. Performance improves when bibliometric algorithms define the clusters and the LLM interprets them.
Overall, LLM-assisted bibliometric synthesis is most promising as a hybrid workflow in which algorithms provide auditable structure and LLMs generate readable descriptions.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.