Benchmarking Large Language Models for Safety Data Extraction
Quick Answer
This study benchmarks Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B for Safety Data Sheet extraction, revealing Gemini 1.5 Pro achieved the highest accuracy at 84%.
Quick Take
This study benchmarks Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B for Safety Data Sheet extraction, revealing Gemini 1.5 Pro achieved the highest accuracy at 84%. Despite strong performance, no model met the 90% accuracy threshold necessary for reliable industrial use, indicating a need for further fine-tuning and domain adaptation.
Key Points
- Gemini 1.5 Pro outperformed others with 84% accuracy in SDS extraction.
- Text-based extraction methods consistently surpassed multimodal approaches.
- No model achieved the 90% accuracy threshold for reliable industrial deployment.
- Future research should focus on domain-specific training and model calibration.
- Human-in-the-Loop verification is essential for safety-critical applications.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1. 5 Pro, GPT-4o, Claude 3. 7 Sonnet, and Llama 3.
1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1. 5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3. 7 Sonnet (79%).
However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.