
New framework for auditing machine unlearning
Quick Answer
Google Research introduces a novel framework for auditing machine unlearning, addressing the need for accountability in AI systems.
Quick Take
Google Research introduces a novel framework for auditing machine unlearning, addressing the need for accountability in AI systems. This framework enables the verification of unlearning processes in various machine learning models, ensuring compliance with data privacy regulations. It emphasizes the importance of reliable unlearning methods to enhance user trust and data protection.
Key Points
- New framework enhances accountability in AI systems through effective auditing of unlearning processes.
- Focuses on verifying unlearning in various machine learning models for compliance with privacy regulations.
- Aims to boost user trust by ensuring reliable data unlearning methods.
- Addresses the growing need for transparency in AI data management practices.
Paper Resources
📖 Reader Mode
~9 min readMachine unlearning allows AI systems to "forget" specific parts of their training data without the massive cost of retraining a model from scratch. This is essential for regulatory compliance (like GDPR’s "Right to be Forgotten"), AI safety, and model quality.
As models process increasingly massive and highly sensitive datasets, verifying machine unlearning has moved from theoretical ideal to a strict requirement, where developers must now mathematically prove privacy. However, because auditors often don’t have access to the model's internal workings or original training data, they must verify the system strictly by querying it and analyzing the output samples.
One method data scientists and researchers rely on for verification is two-sample testing, a statistical method that determines if two sets of data observations come from entirely different underlying distributions. For example, to verify unlearning, auditors might compare outputs from a model that never saw a specific record against a model that supposedly "forgot" it. If the outputs are statistically different within a defined threshold, the unlearning failed.
As models grow in size and complexity, two-sample testing and other statistical tools used for machine unlearning auditing become challenging to implement and they lose statistical power. To identify a real violation from random noise inherent in large-scale models, and with enough statistical significance, an auditor needs to extract a large number of samples. This makes real-world testing completely computationally very expensive..
To address this growing challenge, we introduce Regularized f-Divergence Kernel Tests, presented at AISTATS 2026, a new framework designed to make auditing ML models much more sensitive, flexible, and accurate. We theoretically prove that our tests naturally control for false positives for any sample size, and that the risk of false negatives reliably converges to zero as the number of available data samples increases.
The challenge: Why standard tools fall short
Evaluating model safety often requires measuring the distance, or divergence, between two complex data sets. Different applications naturally require different notions of “distance”. While popular standard tools like maximum mean discrepancy (MMD) excel at detecting broad, global shifts across data (such as a model systematically generating brighter images than its counterpart), they often lack the necessary specificity to capture complex anomalies. For instance, if the addition of a specific person's data causes a model to generate a highly specific outlier output only when prompted in a very exact way — while having an equal distribution on all other samples — traditional MMD tests might completely overlook this local shift.
Also, most existing testing frameworks force researchers to make error-prone manual choices, such as picking the specific statistic best suited for either global or local shifts or tuning complex settings like kernel bandwidths and regularization parameters.
In addition to being hard in practice, two-sample testing as a verification method is flawed when verifying unlearning of ML models. Consider the example below showing how two models trained from scratch on the exact same data can produce different distributions. The blue distribution is the distribution of a model retrained without compromised data. However, its distribution is different from the standard (green) due to retraining with different batch sizes. This results in a false positive, indicating that the tested model is unsafe.
Furthermore, recent work shows that an AI model can never perfectly “forget” data just by tweaking its current settings; unless it re-traces every step of its original training, it will always leave behind a permanent footprint of the information it was supposed to delete. Accordingly, achieving perfect “retrain equivalence” is fundamentally impossible for standard, local unlearning algorithms and a traditional two-sample test can always find a dependence on the “forget set”.
The framework
We resolve this challenge by proposing a relative distance test that measures whether an unlearned model is distributionally closer to a safely retrained model or to the original, compromised one.
Our test acts as a highly adaptable statistical toolkit that leverages f-divergences to allow auditors to pinpoint highly specific types of data shifts, including:
- Chi-squared and Kullback-Liebler (KL) divergences: These are highly effective for identifying smooth and localized differences in data, such as outliers in physical models.
- Hockey-stick divergence: Specially captures definitions for privacy and unlearning, this divergence operates with a parameter that controls the degree of statistical indistinguishability. It effectively establishes an acceptable threshold, ignoring minor differences below a safety budget and only triggering an alert when a meaningful privacy breach occurs.
Calculating these divergences on high-dimensional, real-world data is notoriously difficult. To make these complex optimization problems tractable without requiring massive amounts of compute, we use kernel regularization methods to estimate the differences efficiently.
Our adaptive testing approach automatically selects the best divergence and the optimal hyperparameter configurations to maximize the reliability of the test, entirely eliminating the need for sample splitting.
Experiments
Because our proposed tests are general, we experimented across a wide variety of problems. We evaluated our framework on perturbed uniforms (synthetic two-sample benchmarks), as well as the Expo1D outlier detection task within physics datasets — a specialized area that uses ML to search for new physical phenomena outside the standard model of particle physics. We used high-energy physics data because that field requires the world’s most precise "difference detectors” — the idea being, if the framework can spot a rare particle that defies the laws of physics, it can spot a tiny privacy leak in an AI model.
We then shifted our primary focus to the critical, real-world applications of auditing differential privacy and evaluating machine unlearning:
- Privacy auditing: Differential privacy provides a framework for protecting user data by introducing calibrated noise, bounding the influence of any single individual. We tested multiple non-private mechanisms by sampling their outputs across two simulated datasets that differed by only one record. If a mechanism is truly private, the two resulting samples must be indistinguishable; if it is flawed, the test should flag the privacy violation.
- Machine unlearning evaluation: Instead of relying on the flawed approach of simply comparing a gold standard model (one retrained from scratch without the forgotten data) to the unlearned model, we leveraged a three-sample relative test, applying it to various established unlearning algorithms, including Selective Synaptic Dampening, pruning, and random label techniques. Our test evaluated whether the unlearned model distribution was closer to the safe gold standard model, or closer to the original, fully trained model that actively memorized the sensitive data.
Results
Our framework successfully recovered or outperformed all previous baseline methods with significantly less manual tuning.
The experimental results demonstrated that no single test consistently outperforms the others across every possible scenario. Instead, different f-divergences act as specialized sensors that "light up" for different types of localized data shifts. By using an aggregated approach across diverse statistics, our framework successfully caught subtle errors and anomalies that standard tests completely missed.
For privacy auditing, the hockey-stick divergence test proved to be a powerful and effective tool. Because it directly aligns with the mathematical foundations of pure differential privacy, it allows auditors to tightly control the acceptable degree of data shift. Our adaptive testing framework successfully caught privacy violations using significantly fewer data samples and requiring far less hyperparameter tuning than previous baseline testers.
In one notable instance, our framework detected violations in a specific sparse vector technique mechanism (SVT3) using only a few thousand samples, while previously studied techniques like DP-Auditorium required millions of samples to approximate the same violation detection rate.
Our findings also suggest a redefinition of how to evaluate machine unlearning. As shown in the table below, we observed that none of the approximate unlearning methods we evaluated were compliant with the strict, standard two-sample unlearning definition. Because two-sample tests simply look for any distributional difference, they incorrectly flagged perfectly safe, retrained models as unlearning failures.
In contrast, our proposed relative three-sample test successfully overcame this flaw. It correctly and consistently identified the safely retrained models as "safe". When evaluating the approximate unlearning algorithms, only the random label technique passed the evaluation.
Other popular methods, such as finetuning, pruning, and Selective Synaptic Dampening, were found to be ineffective at truly forgetting the targeted data. We emphasize that our primary goal in these experiments was the evaluation of the unlearning methodologies, rather than designing the algorithms themselves. Consequently, we used simplified implementations of these unlearning procedures; more rigorous setups will be required to rank unlearning methods in practical production environments.
Conclusion
Our newly proposed framework provides a much more precise, adaptable, and mathematically sound lens for examining ML behavior. By leveraging regularized f-Divergence kernel tests, researchers and auditors can now statistically prove whether a model is behaving unsafely or leaking data across a massive class of problems and complex distributional shifts.
As this field evolves, theoretically grounding our empirical observations to characterize exactly which specific divergence is optimal for other novel tasks remains an exciting direction for future work. Establishing tighter sample complexity bounds will also be a key focus to make these audits even more efficient.
Acknowledgements
The work described here was done jointly with Antonin Schrab and Arthur Gretton. We thank Nicole Mitchell and Eleni Triantafillou for insightful feedback, and Kimberly Schwede for the graphics and Mark Simborg for helpful edits.
— Originally published at research.google
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from Google Research
See more →
Unlocking dependable responses with Gemini Enterprise Agent Platform’s Agentic
Google Research introduces the Gemini Enterprise Agent Platform's Agentic RAG, enhancing data management with improved response reliability. This platform leverages advanced retrieval-augmented generation (RAG) techniques to optimize information retrieval, significantly benefiting enterprises by providing accurate and contextually relevant responses. The implementation aims to streamline workflows and reduce operational costs for businesses relying on AI-driven solutions.
