Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA
Quick Take
The study introduces DOSEBENCH, a benchmark for evaluating LLMs on OTC dosing QA, revealing that models often fail in rolling-window reasoning and can provide misleadingly confident responses. Four LLMs were assessed, yielding 1,620 responses, highlighting the need for improved temporal reasoning and safety handling in medical applications.
Key Points
- DOSEBENCH consists of 81 curated OTC dosing scenarios focused on acetaminophen and ibuprofen.
- Four LLMs were evaluated, resulting in 1,620 responses across various metrics.
- Models struggle with rolling-window reasoning and ambiguity-sensitive cases.
- Confident responses can still violate dosing constraints, indicating safety risks.
- The benchmark serves as a practical testbed for evaluating medical QA capabilities.
Article Content
From source RSS / original summaryarXiv:2606. 04262v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories.
We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses.
Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.