GENIE: A Fine-Grained Measure for Novelty
Quick Answer
The paper introduces GENIE, a fine-grained metric for assessing the novelty of model-generated content, addressing the shortcomings of holistic metrics.
Quick Take
The paper introduces GENIE, a fine-grained metric for assessing the novelty of model-generated content, addressing the shortcomings of holistic metrics. It demonstrates that GENIE effectively captures task-specific features of novelty, providing insights into model creativity and the impact of mitigation methods.
Key Points
- GENIE measures novelty in model outputs with task-specific features.
- Holistic metrics fail to capture the complexity of novelty effectively.
- The study evaluates the effectiveness of methods aimed at enhancing creativity.
- GENIE provides insights into which properties contribute to novelty.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 12790v1 Announce Type: new Abstract: Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner.
We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.