A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

arXiv cs.CL·Takehiro Ishikawa, Jon Duke

3h ago

·~2 min·5/26/2026·en·0

Quick Take

The paper audits clinical-interview depression detection benchmarks using multiple probes and reveals significant discrepancies in model performance.

Key Points

Re-evaluated E-DAIC with a hybrid model achieving macro-F1 = 0.723.
Moderate alignment between development cross-validation and official test rankings.
Text models outperform audio models on symptom-dense interview slices.

Article Content

From source RSS / original summary

arXiv:2605. 23977v1 Announce Type: new Abstract: This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.

723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners.

Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32. 3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker.

Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Extracting Training Data from Diffusion Language Models via Infilling

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Related in this space

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems