The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models
Quick Answer
The study explores the 'dark regulome' in gliomas using three foundation models—Caduceus-Ph, HyenaDNA, and Enformer—revealing a significant 10kb proximal-regulatory horizon.
Quick Take
The study explores the 'dark regulome' in gliomas using three foundation models—Caduceus-Ph, HyenaDNA, and Enformer—revealing a significant 10kb proximal-regulatory horizon. A diagnostic tool is introduced to differentiate predictability from regulatory variance, demonstrating that top-100 elements are 3.3x enriched for brain eQTLs across models.
Key Points
- Introduced a diagnostic tool to separate predictability-driven from regulation-driven variance.
- Top-100 elements across models are 3.3x enriched for matching brain eQTLs.
- Caduceus-Ph achieved AUC of 0.985 for top-decile membership.
- Residualization-and-permutation diagnostics applied to 30,448 dark genome elements.
- Cross-architecture decomposition reveals distinct layers of predictability and regulatory output.
Article Content
From source RSS / original summaryarXiv:2606. 06834v1 Announce Type: new Abstract: High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells.
The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined.
Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0. 985$.
Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.
3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.