Security

Microsoft CTI-REALM Benchmarks AI Detection Engineering

3 min read

Summary

Microsoft has introduced CTI-REALM, an open-source benchmark designed to test whether AI agents can actually perform detection engineering tasks end to end, from interpreting threat intelligence reports to generating and refining KQL and Sigma detection rules. This matters because it gives security teams a more realistic way to evaluate AI for SOC operations, focusing on measurable operational outcomes across real environments instead of simple cybersecurity question answering.

Audio Summary

0:00--:--
Need help with Security?Talk to an Expert

Introduction

Microsoft has announced CTI-REALM, a new open-source benchmark aimed at a growing challenge in security operations: determining whether AI agents can do real detection engineering work, not just answer cybersecurity questions. For security teams evaluating AI for SOC and detection use cases, this matters because the benchmark focuses on operational outcomes—building and validating detections from threat intelligence.

What’s new with CTI-REALM

CTI-REALM (Cyber Threat Intelligence Real World Evaluation and LLM Benchmarking) is built to test the full workflow security analysts follow when creating detections.

Key capabilities

  • Evaluates AI agents on end-to-end detection rule generation rather than isolated CTI knowledge tests.
  • Uses 37 curated CTI reports from public sources including Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk.
  • Measures performance across Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure.
  • Scores not only final outputs, but also intermediate steps such as:
    • CTI report understanding
    • MITRE ATT&CK technique mapping
    • Data source identification
    • KQL query refinement
    • Sigma rule generation
  • Provides agents with realistic tooling, including CTI repositories, schema explorers, Kusto query engines, MITRE ATT&CK references, and Sigma databases.

Early findings from Microsoft’s testing

Microsoft evaluated 16 frontier model configurations on CTI-REALM-50, a 50-task benchmark set.

Notable results include:

  • Anthropic Claude models led the rankings, largely due to stronger tool use and iterative query refinement.
  • In the GPT-5 family, medium reasoning outperformed high reasoning, suggesting that more reasoning can reduce effectiveness in agentic detection scenarios.
  • Azure cloud detection proved the most difficult, with lower scores than Linux and AKS due to the complexity of correlating multiple telemetry sources.
  • Removing CTI-specific tools reduced performance across all tested models.
  • Adding human-authored workflow guidance significantly improved smaller model performance.

Why this matters for IT and security administrators

For SOC leaders, detection engineers, and security architects, CTI-REALM offers a more practical way to evaluate AI before using it in production workflows. Instead of relying on broad benchmark scores, teams can identify where a model struggles—such as threat comprehension, telemetry mapping, or rule specificity.

This can help organizations:

  • Validate AI model suitability for detection engineering tasks
  • Identify where human review and guardrails are still required
  • Compare models objectively before operational deployment
  • Improve confidence in AI-assisted detection development

Next steps

Security teams interested in AI-assisted detection engineering should:

  • Review the CTI-REALM research paper and benchmark methodology
  • Test candidate models against the benchmark before production adoption
  • Use results to define review processes and guardrails
  • Monitor the Inspect AI repository for CTI-REALM availability and community contributions

Microsoft is positioning CTI-REALM as a community resource to help the industry benchmark models consistently and adopt AI more safely in security operations.

Need help with Security?

Our experts can help you implement and optimize your Microsoft solutions.

Talk to an Expert

Stay updated on Microsoft technologies

SecurityAI agentsthreat intelligencedetection engineeringKQL

Related Posts

Security

AI Memory Security in Microsoft 365 Explained

Microsoft has outlined how it secures AI memory in Microsoft 365, addressing emerging risks such as memory poisoning and delayed tool execution. The update matters because persistent AI memory can improve personalization and agent performance, but it also creates new security, compliance, and audit requirements for IT and security teams.

Security

Parallel Threat Activity: Microsoft DART Findings

Microsoft Incident Response detailed a complex intrusion in which two unrelated threat actors operated simultaneously in the same environment, complicating attribution and detection. The case highlights how ransomware activity, SharePoint exploitation, trusted tool abuse, and identity compromise can overlap across hybrid estates, reinforcing the need for strong telemetry, patching, and coordinated response.

Security

AutoJack RCE in AutoGen Studio: Security Lessons

Microsoft security researchers detailed AutoJack, an exploit chain in AutoGen Studio that could let untrusted web content rendered by an AI browsing agent trigger remote code execution on the host. Although the vulnerable MCP WebSocket surface was never shipped in a PyPI release and the issue was hardened upstream during development, the findings highlight important security risks for agent frameworks that combine web browsing with privileged local services.

Security

Microsoft Security Forrester Study Reports 124% ROI

A new Forrester Total Economic Impact study found that organizations consolidating on Microsoft Security could see a projected 124% ROI over three years. The report highlights lower breach risk, reduced remediation costs, lower technology spend, and productivity gains as key reasons unified security platforms matter in the AI era.

Security

Mastra npm Supply Chain Attack: What IT Teams Need to Know

Microsoft has detailed a large-scale npm supply chain compromise affecting more than 140 Mastra packages after an attacker took over a maintainer account and injected a malicious dependency. The attack is significant because the payload executed during npm install, putting developer workstations and CI/CD pipelines at risk even if the package was never directly used in code.

Security

Crypto Clipper Malware Uses Tor and USB Worm Spread

Microsoft has detailed a Windows-based crypto clipper campaign that uses malicious shortcut files, a bundled Tor client, and worm-like USB propagation to steal wallet data and maintain persistence. The threat matters because it combines clipboard theft, screenshot exfiltration, and remote code execution with stealthy Tor-based command and control, making behavioral detection critical for defenders.