SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Quick Take
SciCustom is a framework for custom evaluation of scientific capabilities in large language models.
Key Points
- Enables custom benchmarks from large-scale scientific data.
- Identifies relevant knowledge units via multi-model consensus.
- Reveals fine-grained LLM capabilities overlooked by standard benchmarks.
📖 Reader Mode
~2 min readAuthors:Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang
Abstract:Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at this https URL.
| Comments: | Accepted to ACL 2026 Main Conference |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.19357 [cs.CL] |
| (or arXiv:2605.19357v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.19357 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yiyang Gu [view email]
[v1]
Tue, 19 May 2026 04:41:43 UTC (2,290 KB)
— Originally published at arxiv.org
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.