Curation and Extraction of Drug-Related Entities… · DeepSignal

Curation and Extraction of Drug-Related Entities from Reddit Platform

arXiv cs.CL·Zewei Wang, Zihan Xu, Yishu Wei, Michael Chary, Yifan Peng

3d ago

·~1 min·5/27/2026·en·3

Quick Take

The ReDose dataset, comprising 6,435 Reddit posts, enhances medical understanding of drug use by extracting DRUG, DOSE, and EFFECT entities. BiomedBERT achieved an F1-score of 0.843 for DRUG extraction, while Llama-3 70B surpassed GPT-4 in performance. However, EFFECT extraction remains difficult, with GPT-4 only achieving a recall of 0.41.

Key Points

ReDose dataset includes 6,435 Reddit posts on substance use.
BiomedBERT achieved an F1-score of 0.843 for drug entity extraction.
Llama-3 70B outperformed GPT-4 in drug extraction performance.
EFFECT extraction remains challenging with GPT-4's recall at 0.41.
The dataset aims to bridge the gap between clinical knowledge and user experiences.

Article Excerpt

From source RSS / original summary

arXiv:2605. 26445v1 Announce Type: new Abstract: Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use.

A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0. 843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0. 79 vs. 0. 72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0. 41.

ReDose captures patient-curated narratives to advance medical data extraction from social media.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

Curation and Extraction of Drug-Related Entities from Reddit Platform

Quick Take

Key Points

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective