ASSERT Framework Turns AI Specs Into Executable Evals
Summary
Microsoft has released ASSERT, an open-source framework that converts natural-language behavior requirements into executable evaluation pipelines for AI models, agents, and applications. The tool helps teams build behavior-specific tests faster, improve regression coverage, and better validate whether AI systems follow product policies and safety expectations.
Introduction
AI teams often document intended behaviors in policy notes, prompts, or product requirements, but turning those expectations into reliable evaluation suites is slow and difficult. Microsoft’s new open-source ASSERT framework aims to close that gap by converting plain-language specifications into runnable, inspectable evaluations for models, agents, and AI applications.
For security and governance teams, this matters because generic AI metrics like relevance or helpfulness do not always catch application-specific failures such as unsafe tool use, policy violations, or risky decision-making.
What is ASSERT?
ASSERT stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. It is designed to make behavior specifications a direct input to AI evaluation, rather than relying on broad benchmarks that may miss real-world product requirements.
How the pipeline works
ASSERT uses four main stages:
- Systematization: Converts a broad behavior requirement into a structured concept specification.
- Taxonomization: Builds an editable taxonomy of permissible and impermissible behaviors.
- Test generation: Creates stratified single-turn or multi-turn test cases across declared conditions such as persona, task type, tool access, or environment.
- Scoring: Evaluates traces against the taxonomy and returns labels, rationales, policy citations, and failure patterns.
A key capability is instrumentation. ASSERT can capture not only final outputs, but also tool calls, retrieved context, routing decisions, and intermediate actions—important for agentic systems where the final answer alone may not explain risky behavior.
Validation results
Microsoft says internal validation showed stronger behavior-specific coverage than a direct generation baseline from the same written intent. According to the study, ASSERT:
- Covered about 1.2x more of the intended behavior space
- Surfaced roughly 1.5x more inspectable cases
- Produced 4x stronger separation between stronger and weaker systems
- Had about half as many saturated cases where all models behaved the same
- Found about 2x more distinct failure patterns
For judge quality, LLM-to-human agreement was typically in the 80–90% range, with human inter-annotator agreement around 90%.
Why this matters for IT and security teams
Organizations deploying copilots, assistants, and agent workflows need repeatable ways to test policy compliance and behavioral boundaries. ASSERT could help teams validate scenarios such as:
- Unsafe health or financial guidance
- Tool-use governance violations
- Task adherence failures
- Restricted data handling issues
- Policy drift during model or workflow updates
Next steps
Teams building internal AI apps or security-sensitive agents should review ASSERT as a possible framework for regression testing and policy validation. Because it is open source, organizations can adapt the taxonomy and test generation process to their own governance, compliance, and operational requirements.
Need help with Security?
Our experts can help you implement and optimize your Microsoft solutions.
Talk to an ExpertStay updated on Microsoft technologies