Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow
Quick Take
This article presents a protocol for evaluating ChatGPT's capability in generating and verifying disease-centric biomedical associations using a RAG-enabled workflow. It emphasizes the use of biomedical ontologies for validation and introduces a self-consistency strategy to enhance generative reliability across different ChatGPT models, addressing limitations in ontology exact-match through semantic verification.
Key Points
- The protocol evaluates ChatGPT's ability to generate biomedical associations.
- Validation uses biomedical ontologies and literature for verification.
- A self-consistency strategy assesses generative reliability across models.
- Semantic verification addresses limitations of ontology exact-match.
- RAG enhances the truth establishment over content generated by LLMs.
Article Excerpt
From source RSS / original summaryarXiv:2605. 30400v1 Announce Type: new Abstract: We present a protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. It outlines how we generate the associations, validate the biological entities using biomedical ontologies, and verify associations using literature. The protocol includes a self-consistency strategy to assess generative reliability across ChatGPT models.
To address ontology exact-match limitations, we provide a use case performing semantic verification through a workflow enabled by Retrieval-Augmented Generation (RAG) powered by open-source large language models (LLMs). This enables LLMs to establish truth over content generated by other LLMs and expose hallucination.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.