BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts
Quick Answer
BioDivergence introduces a novel evaluation framework for contextual contradictions in biomedical abstracts, featuring a six-class conflict taxonomy and a silver benchmark of 11,865 claim pairs.
Quick Take
BioDivergence introduces a novel evaluation framework for contextual contradictions in biomedical abstracts, featuring a six-class conflict taxonomy and a silver benchmark of 11,865 claim pairs. The Mistral-7B-Instruct-v0.3 model achieved 0.5523 accuracy on the primary test set, highlighting significant performance differences in article-disjoint settings.
Key Points
- BioDivergence features a 13-axis divergence ontology for nuanced evaluation.
- The framework distinguishes between contextual divergence and direct contradiction.
- Mistral-7B-Instruct-v0.3 achieved 0.3894 contextual-F1 on the primary test set.
- The silver benchmark includes claim pairs from five biomedical domains.
- Results indicate a 12-point drop in accuracy under article-disjoint conditions.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11208v1 Announce Type: new Abstract: Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence.
To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1. 0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison.
Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0. 3 achieves 0. 5523 accuracy and 0. 3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.