OpenRTLSet: A Fully Open-Source Dataset for Large Language… | AI Deep Signal

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

arXiv cs.CL·Jinghua Wang, Lily Jiaxin Wan, Sanjana Pingali, Scott Smith, Manvi Jha, Shalini Sivakumar, Xing Zhao, Kaiwen Cao, Deming Chen

6/10/2026

·~2 min·6/10/2026·en·1

Quick Answer

OpenRTLSet is the largest open-source dataset for hardware design, featuring over 131,000 Verilog code samples.

Quick Take

It enables fine-tuning of language models like Qwen and Granite for Verilog code generation, demonstrating superior performance in hardware design tasks through open-source methodologies.

Key Points

Dataset includes 102k Verilog modules from GitHub and 29k from translations.
Paired natural language descriptions generated using DeepSeek-R1 model.
Explores quantization techniques like INT4 and BF16 for performance.
Supports fine-tuning for various language model families.
Establishes a foundation for accessible research and commercial use.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Source Excerpt

OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code s

Read the full article on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Isabel Xu (The Overlake School), Cynthia Xu (The Overlake School), Rachel Ren (Edwards Vacuum Inc.), Cong Guo (The University of Memphis), Jiacheng Ding (The University of Memphis)

5d ago

FeaturedOriginal

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

AI Summary

TriAgent introduces a cost-efficient multi-agent system for financial sentiment analysis, combining VADER, FinBERT, and Qwen2.5. It achieves an F1 score of ~0.87 with significant savings of $9.3M/year at a 10M-user scale compared to GPT-4o-mini, while also detecting hallucinations with an AUC of 0.90.

#LLM #Agent #AI Startup #Enterprise AI

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quick Answer

Quick Take

Key Points

Paper Resources

Source Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

TriAgent: Divergence-Aware Multi-Agent Committees for Cost-Efficient Financial Sentiment Analysis

RF-Agent: A Practical Framework for Building Language Agents for RFIC Design

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

TriAgent: Divergence-Aware Committees for Cost-Efficient Financial Sentiment Analysis