Microsoft CTI-REALM Benchmarks AI Detection Engineering

March 20, 20263 min read

Summary

Microsoft has introduced CTI-REALM, an open-source benchmark designed to test whether AI agents can actually perform detection engineering tasks end to end, from interpreting threat intelligence reports to generating and refining KQL and Sigma detection rules. This matters because it gives security teams a more realistic way to evaluate AI for SOC operations, focusing on measurable operational outcomes across real environments instead of simple cybersecurity question answering.

Introduction

Microsoft has announced CTI-REALM, a new open-source benchmark aimed at a growing challenge in security operations: determining whether AI agents can do real detection engineering work, not just answer cybersecurity questions. For security teams evaluating AI for SOC and detection use cases, this matters because the benchmark focuses on operational outcomes—building and validating detections from threat intelligence.

What’s new with CTI-REALM

CTI-REALM (Cyber Threat Intelligence Real World Evaluation and LLM Benchmarking) is built to test the full workflow security analysts follow when creating detections.

Key capabilities

Evaluates AI agents on end-to-end detection rule generation rather than isolated CTI knowledge tests.
Uses 37 curated CTI reports from public sources including Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk.
Measures performance across Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure.
Scores not only final outputs, but also intermediate steps such as:
- CTI report understanding
- MITRE ATT&CK technique mapping
- Data source identification
- KQL query refinement
- Sigma rule generation
Provides agents with realistic tooling, including CTI repositories, schema explorers, Kusto query engines, MITRE ATT&CK references, and Sigma databases.

Early findings from Microsoft’s testing

Microsoft evaluated 16 frontier model configurations on CTI-REALM-50, a 50-task benchmark set.

Notable results include:

Anthropic Claude models led the rankings, largely due to stronger tool use and iterative query refinement.
In the GPT-5 family, medium reasoning outperformed high reasoning, suggesting that more reasoning can reduce effectiveness in agentic detection scenarios.
Azure cloud detection proved the most difficult, with lower scores than Linux and AKS due to the complexity of correlating multiple telemetry sources.
Removing CTI-specific tools reduced performance across all tested models.
Adding human-authored workflow guidance significantly improved smaller model performance.

Why this matters for IT and security administrators

For SOC leaders, detection engineers, and security architects, CTI-REALM offers a more practical way to evaluate AI before using it in production workflows. Instead of relying on broad benchmark scores, teams can identify where a model struggles—such as threat comprehension, telemetry mapping, or rule specificity.

This can help organizations:

Validate AI model suitability for detection engineering tasks
Identify where human review and guardrails are still required
Compare models objectively before operational deployment
Improve confidence in AI-assisted detection development

Next steps

Security teams interested in AI-assisted detection engineering should:

Review the CTI-REALM research paper and benchmark methodology
Test candidate models against the benchmark before production adoption
Use results to define review processes and guardrails
Monitor the Inspect AI repository for CTI-REALM availability and community contributions

Microsoft is positioning CTI-REALM as a community resource to help the industry benchmark models consistently and adopt AI more safely in security operations.

Microsoft CTI-REALM Benchmarks AI Detection Engineering

Introduction

What’s new with CTI-REALM

Key capabilities

Early findings from Microsoft’s testing

Why this matters for IT and security administrators

Next steps

Need help with Security?

Related Posts

Dirty Frag Linux Vulnerability Raises Root Risk

AI Agent RCE Flaws in Semantic Kernel Explained

Microsoft Entra Passkeys: 2026 Passwordless Updates

Microsoft AI SOC Report 2026: KuppingerCole Leader

ClickFix macOS Campaign Delivers Infostealers

AiTM Phishing Campaign Targets Microsoft 365 Users