SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

arXiv cs.CL·Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

17h ago

·~2 min·5/20/2026·en·0

Quick Take

SciCustom is a framework for custom evaluation of scientific capabilities in large language models.

Key Points

Enables custom benchmarks from large-scale scientific data.
Identifies relevant knowledge units via multi-model consensus.
Reveals fine-grained LLM capabilities overlooked by standard benchmarks.

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

Authors:Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at this https URL.

Comments:	Accepted to ACL 2026 Main Conference
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2605.19357 [cs.CL]
	(or arXiv:2605.19357v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2605.19357 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yiyang Gu [view email]
[v1] Tue, 19 May 2026 04:41:43 UTC (2,290 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Quick Take

Key Points

📖 Reader Mode

Submission history

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Related in this space

From Prompts to Protocols: An AI Agent for Laboratory Automation

Agentic Trading: When LLM Agents Meet Financial Markets