SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions
Quick Answer
SlideCheck is a novel tool that enhances the pretraining of pathology foundation models by providing explicit abnormality and malignancy scores for patch selection.
Quick Take
SlideCheck is a novel tool that enhances the pretraining of pathology foundation models by providing explicit abnormality and malignancy scores for patch selection. It utilizes a dual-head MLP to improve data quality and control over pretraining datasets, demonstrating that curated subsets can achieve near full-data performance, thus optimizing the efficiency of self-supervised ViT pretraining.
Key Points
- SlideCheck uses a dual-head MLP to model abnormal morphology and malignancy evidence.
- It provides scores for organizing and auditing pathology pretraining data effectively.
- Curated subsets defined by SlideCheck can achieve performance close to full datasets.
- The tool influences downstream behavior in self-supervised ViT pretraining.
- It transforms large patch pools into controllable and reusable pretraining datasets.
Article Content
From source RSS / original summaryarXiv:2606. 07590v1 Announce Type: new Abstract: Pathology foundation models are pretrained on large streams of WSI-derived patches, while supervision during data construction is often slide-level, sparse, or heterogeneous. This mismatch makes it difficult to understand and control which biological patterns enter the pretraining data. We propose SlideCheck, a lightweight pretraining data guidance tool built on frozen pathology foundation model patch features.
Rather than serving as a standalone patch diagnostic model, SlideCheck provides explicit abnormality and malignancy scores for organizing, filtering, and auditing pathology pretraining data. SlideCheck uses a dual-head MLP to separately model broad abnormal morphology and malignant evidence. A regularized feature-space scorer provides a supervised anchor for patch-level evidence estimation, while score-attention agreement combines patch scores with WSI-level MIL attention to mine high-confidence pseudo labels.
The same scores are then used to construct broad-positive ViT pretraining subsets, where a patch is selected if either abnormality or malignancy evidence exceeds a threshold. Experiments show that SlideCheck-defined data distributions influence the downstream behavior of self-supervised ViT pretraining, indicating that biological composition is an important controllable factor in pathology foundation model development.
Curated subsets can approach full-data performance, suggesting that explicitly scored patch pools may support more efficient and auditable pretraining data construction. These findings position SlideCheck as a data guidance and auditing layer for transforming large, undifferentiated patch pools into controllable and reusable pretraining datasets.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.