ASSERT Framework Turns AI Specs Into Executable Evals

June 10, 20263 min read

Summary

Microsoft has released ASSERT, an open-source framework that converts natural-language behavior requirements into executable evaluation pipelines for AI models, agents, and applications. The tool helps teams build behavior-specific tests faster, improve regression coverage, and better validate whether AI systems follow product policies and safety expectations.

Introduction

AI teams often document intended behaviors in policy notes, prompts, or product requirements, but turning those expectations into reliable evaluation suites is slow and difficult. Microsoft’s new open-source ASSERT framework aims to close that gap by converting plain-language specifications into runnable, inspectable evaluations for models, agents, and AI applications.

For security and governance teams, this matters because generic AI metrics like relevance or helpfulness do not always catch application-specific failures such as unsafe tool use, policy violations, or risky decision-making.

What is ASSERT?

ASSERT stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing. It is designed to make behavior specifications a direct input to AI evaluation, rather than relying on broad benchmarks that may miss real-world product requirements.

How the pipeline works

ASSERT uses four main stages:

Systematization: Converts a broad behavior requirement into a structured concept specification.
Taxonomization: Builds an editable taxonomy of permissible and impermissible behaviors.
Test generation: Creates stratified single-turn or multi-turn test cases across declared conditions such as persona, task type, tool access, or environment.
Scoring: Evaluates traces against the taxonomy and returns labels, rationales, policy citations, and failure patterns.

A key capability is instrumentation. ASSERT can capture not only final outputs, but also tool calls, retrieved context, routing decisions, and intermediate actions—important for agentic systems where the final answer alone may not explain risky behavior.

Validation results

Microsoft says internal validation showed stronger behavior-specific coverage than a direct generation baseline from the same written intent. According to the study, ASSERT:

Covered about 1.2x more of the intended behavior space
Surfaced roughly 1.5x more inspectable cases
Produced 4x stronger separation between stronger and weaker systems
Had about half as many saturated cases where all models behaved the same
Found about 2x more distinct failure patterns

For judge quality, LLM-to-human agreement was typically in the 80–90% range, with human inter-annotator agreement around 90%.

Why this matters for IT and security teams

Organizations deploying copilots, assistants, and agent workflows need repeatable ways to test policy compliance and behavioral boundaries. ASSERT could help teams validate scenarios such as:

Unsafe health or financial guidance
Tool-use governance violations
Task adherence failures
Restricted data handling issues
Policy drift during model or workflow updates

Next steps

Teams building internal AI apps or security-sensitive agents should review ASSERT as a possible framework for regression testing and policy validation. Because it is open source, organizations can adapt the taxonomy and test generation process to their own governance, compliance, and operational requirements.

ASSERT Framework Turns AI Specs Into Executable Evals

Introduction

What is ASSERT?

How the pipeline works

Validation results

Why this matters for IT and security teams

Next steps

Need help with Security?

Related Posts

Email Threat Landscape Q2 2026: Key Security Trends

Microsoft and AXA XL Expand Cyber Incident Response

Microsoft Black Hat 2026: AI and Supply Chain Defense

ACR Stealer Campaigns: ClickFix Threats Rise

AI Agent Least Privilege: Identity and RBAC Guide

AsyncAPI npm Supply Chain Attack: Import-Time Malware