Show HN: Spec27 – Spec-driven validation for AI agents
Quick Answer
Spec27 is a new tool for validating AI agents, focusing on spec-driven testing to ensure reliability as models and systems evolve.
Quick Take
Spec27 is a new tool for validating AI agents, focusing on spec-driven testing to ensure reliability as models and systems evolve. It allows teams to define reusable specifications for agent behavior, generating tests that assess robustness and sensitivity to changes, currently in early access for language-model-based agents.
Key Points
- Focuses on external testing without needing access to internal agent stacks.
- Allows teams to create reusable specifications for agent behavior.
- Generates adversarial and robustness checks automatically.
- Currently strongest for single-turn agent validation, multi-turn support is planned.
- Open for early access, seeking feedback from users deploying internal and vendor agents.
Article Content
From source RSS / original summaryHi HN! We’re a team of ML validation specialists and we’ve been building /Spec27, a tool for testing whether AI agents still do their job safely and reliably as models, prompts, tools, and surrounding systems change. <p>We started working on this because a lot of current LLM evaluation work seems aimed at scoring general model behavior, while many teams are deploying systems that have a specific mission to fulfill.
Many of the tools also assume you have full access to the agent stack and traces so you can place SDKs and Gateways, but a lot of agents are being created on vendor platforms where this isn’t possible. <p>As a result, we approaches it from the outside in: all tests just run to the primary interfaces of an Agent and don’t assume anything about internals. The other important things about the approach is spec-driven.
Instead of treating testing as a one-off benchmark or static eval set, we let teams define reusable specifications for the behavior they want from an agent, then generate tests against those specs. With this you can automatically generate adversarial and robustness checks, so you can see what an agent is sensitive to and what kinds of changes cause it to fail.
<p>We’ve worked on validation for other AI systems before, including vision and tabular workflows, and /Spec27 is our new product for language-model-based agents. Currently in early access, so we’d love feedback! The current version is strongest for single-turn agent and application validation. We do not fully support multi-turn interactions yet, and better telemetry/tool-call integration is still on our roadmap.
<p>We’ve made the product open to try for HN readers, with a sample flow so it’s easy to poke around without much setup. We’d especially love feedback from people deploying internal agents, vendor agents, or other AI systems where reliability matters more than benchmark scores.
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from Hacker News
See more →Show HN: RLM-based local debugger for AI agent traces
HALO (Hierarchal Agent Loop Optimizer) is an open-source tool designed for debugging AI agents by analyzing OTEL compliant execution traces. It utilizes a Recursive Language Model (RLM) to efficiently identify patterns and systemic issues, enabling developers to optimize their agents iteratively without complex setups.


