Benchmarking Large Language Models for Safety Data Extraction | AI Deep Signal

Benchmarking Large Language Models for Safety Data Extraction

arXiv cs.CL·Jonas Grill, Thomas Bayer, S\"oren Berlinger

6/11/2026

·~1 min·6/11/2026·en·0

Quick Answer

This study benchmarks Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B for Safety Data Sheet extraction, revealing Gemini 1.5 Pro achieved the highest accuracy at 84%.

Quick Take

Despite strong performance, no model met the 90% accuracy threshold necessary for reliable industrial use, indicating a need for further fine-tuning and domain adaptation.

Key Points

Gemini 1.5 Pro outperformed others with 84% accuracy in SDS extraction.
Text-based extraction methods consistently surpassed multimodal approaches.
No model achieved the 90% accuracy threshold for reliable industrial deployment.
Future research should focus on domain-specific training and model calibration.
Human-in-the-Loop verification is essential for safety-critical applications.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

arXiv:2606. 11204v1 Announce Type: new Abstract: Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1. 5 Pro, GPT-4o, Claude 3. 7 Sonnet, and Llama 3.

1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. …

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

5d ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

Benchmarking Large Language Models for Safety Data Extraction

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis