Protocol for evaluating ChatGPT in biomedical… · DeepSignal

Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

arXiv cs.CL·Ahmed Abdeen Hamed, Luis M. Rocha

4h ago

·~1 min·6/1/2026·en·0

Quick Take

This article presents a protocol for evaluating ChatGPT's capability in generating and verifying disease-centric biomedical associations using a RAG-enabled workflow. It emphasizes the use of biomedical ontologies for validation and introduces a self-consistency strategy to enhance generative reliability across different ChatGPT models, addressing limitations in ontology exact-match through semantic verification.

Key Points

The protocol evaluates ChatGPT's ability to generate biomedical associations.
Validation uses biomedical ontologies and literature for verification.
A self-consistency strategy assesses generative reliability across models.
Semantic verification addresses limitations of ontology exact-match.
RAG enhances the truth establishment over content generated by LLMs.

Article Excerpt

From source RSS / original summary

arXiv:2605. 30400v1 Announce Type: new Abstract: We present a protocol to evaluate ChatGPT's ability to generate disease-centric biomedical associations. It outlines how we generate the associations, validate the biological entities using biomedical ontologies, and verify associations using literature. The protocol includes a self-consistency strategy to assess generative reliability across ChatGPT models.

To address ontology exact-match limitations, we provide a use case performing semantic verification through a workflow enabled by Retrieval-Augmented Generation (RAG) powered by open-source large language models (LLMs). This enables LLMs to establish truth over content generated by other LLMs and expose hallucination.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy