Security

Microsoft CTI-REALM Benchmarks AI Detection Engineering

3 min read

Summary

Microsoft has introduced CTI-REALM, an open-source benchmark designed to test whether AI agents can actually perform detection engineering tasks end to end, from interpreting threat intelligence reports to generating and refining KQL and Sigma detection rules. This matters because it gives security teams a more realistic way to evaluate AI for SOC operations, focusing on measurable operational outcomes across real environments instead of simple cybersecurity question answering.

Audio Summary

0:00--:--
Need help with Security?Talk to an Expert

Introduction

Microsoft has announced CTI-REALM, a new open-source benchmark aimed at a growing challenge in security operations: determining whether AI agents can do real detection engineering work, not just answer cybersecurity questions. For security teams evaluating AI for SOC and detection use cases, this matters because the benchmark focuses on operational outcomes—building and validating detections from threat intelligence.

What’s new with CTI-REALM

CTI-REALM (Cyber Threat Intelligence Real World Evaluation and LLM Benchmarking) is built to test the full workflow security analysts follow when creating detections.

Key capabilities

  • Evaluates AI agents on end-to-end detection rule generation rather than isolated CTI knowledge tests.
  • Uses 37 curated CTI reports from public sources including Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk.
  • Measures performance across Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure.
  • Scores not only final outputs, but also intermediate steps such as:
    • CTI report understanding
    • MITRE ATT&CK technique mapping
    • Data source identification
    • KQL query refinement
    • Sigma rule generation
  • Provides agents with realistic tooling, including CTI repositories, schema explorers, Kusto query engines, MITRE ATT&CK references, and Sigma databases.

Early findings from Microsoft’s testing

Microsoft evaluated 16 frontier model configurations on CTI-REALM-50, a 50-task benchmark set.

Notable results include:

  • Anthropic Claude models led the rankings, largely due to stronger tool use and iterative query refinement.
  • In the GPT-5 family, medium reasoning outperformed high reasoning, suggesting that more reasoning can reduce effectiveness in agentic detection scenarios.
  • Azure cloud detection proved the most difficult, with lower scores than Linux and AKS due to the complexity of correlating multiple telemetry sources.
  • Removing CTI-specific tools reduced performance across all tested models.
  • Adding human-authored workflow guidance significantly improved smaller model performance.

Why this matters for IT and security administrators

For SOC leaders, detection engineers, and security architects, CTI-REALM offers a more practical way to evaluate AI before using it in production workflows. Instead of relying on broad benchmark scores, teams can identify where a model struggles—such as threat comprehension, telemetry mapping, or rule specificity.

This can help organizations:

  • Validate AI model suitability for detection engineering tasks
  • Identify where human review and guardrails are still required
  • Compare models objectively before operational deployment
  • Improve confidence in AI-assisted detection development

Next steps

Security teams interested in AI-assisted detection engineering should:

  • Review the CTI-REALM research paper and benchmark methodology
  • Test candidate models against the benchmark before production adoption
  • Use results to define review processes and guardrails
  • Monitor the Inspect AI repository for CTI-REALM availability and community contributions

Microsoft is positioning CTI-REALM as a community resource to help the industry benchmark models consistently and adopt AI more safely in security operations.

Need help with Security?

Our experts can help you implement and optimize your Microsoft solutions.

Talk to an Expert

Stay updated on Microsoft technologies

SecurityAI agentsthreat intelligencedetection engineeringKQL

Related Posts

Security

Dirty Frag Linux Vulnerability Raises Root Risk

Microsoft has warned of active exploitation involving the newly disclosed Dirty Frag Linux local privilege escalation vulnerability, which can help attackers move from a low-privileged account to root. The issue affects kernel networking components such as esp4, esp6, and rxrpc, making it especially important for administrators to review module exposure, restrict local access, and prepare for vendor kernel patches.

Security

AI Agent RCE Flaws in Semantic Kernel Explained

Microsoft Defender researchers disclosed two fixed vulnerabilities in Semantic Kernel that could let prompt injection escalate into host-level remote code execution in AI agents. The findings matter because they show how unsafe tool parameter handling in agent frameworks can turn natural language inputs into code execution paths, raising the stakes for organizations building or securing AI-powered apps.

Security

Microsoft Entra Passkeys: 2026 Passwordless Updates

Microsoft outlined major passkey and account recovery updates across Entra ID, Windows, External ID, and Microsoft Password Manager as part of World Passkey Day. The changes matter for IT teams because they expand phishing-resistant sign-in options, improve recovery security, and continue the retirement of weaker authentication methods such as security questions.

Security

Microsoft AI SOC Report 2026: KuppingerCole Leader

Microsoft says it has been named an Overall Leader and Market Leader in KuppingerCole Analysts’ 2026 Emerging AI Security Operations Center report. The announcement highlights Microsoft’s push beyond traditional SOAR toward AI-driven, agent-assisted security operations in Sentinel and Security Copilot to help SOC teams improve speed, consistency, and scale.

Security

ClickFix macOS Campaign Delivers Infostealers

Microsoft has identified a new ClickFix-style campaign targeting macOS users with fake troubleshooting and utility instructions hosted on blogs and content platforms. Instead of downloading apps, victims are tricked into running Terminal commands that bypass typical macOS app checks and deploy infostealers such as Macsync, SHub Stealer, and AMOS.

Security

AiTM Phishing Campaign Targets Microsoft 365 Users

Microsoft has detailed a large-scale adversary-in-the-middle (AiTM) phishing campaign that used fake code-of-conduct investigations to steal authentication tokens. The attack combined polished social engineering, staged CAPTCHA pages, and a legitimate Microsoft sign-in flow, highlighting why phishing-resistant protections and stronger email defenses matter.