Security

Microsoft CTI-REALM Benchmarks AI Detection Engineering

3 dk okuma

Özet

Microsoft has introduced CTI-REALM, an open-source benchmark designed to test whether AI agents can actually perform detection engineering tasks end to end, from interpreting threat intelligence reports to generating and refining KQL and Sigma detection rules. This matters because it gives security teams a more realistic way to evaluate AI for SOC operations, focusing on measurable operational outcomes across real environments instead of simple cybersecurity question answering.

Security konusunda yardıma mı ihtiyacınız var?Bir uzmanla konuşun

Introduction

Microsoft has announced CTI-REALM, a new open-source benchmark aimed at a growing challenge in security operations: determining whether AI agents can do real detection engineering work, not just answer cybersecurity questions. For security teams evaluating AI for SOC and detection use cases, this matters because the benchmark focuses on operational outcomes—building and validating detections from threat intelligence.

What’s new with CTI-REALM

CTI-REALM (Cyber Threat Intelligence Real World Evaluation and LLM Benchmarking) is built to test the full workflow security analysts follow when creating detections.

Key capabilities

  • Evaluates AI agents on end-to-end detection rule generation rather than isolated CTI knowledge tests.
  • Uses 37 curated CTI reports from public sources including Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk.
  • Measures performance across Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure.
  • Scores not only final outputs, but also intermediate steps such as:
    • CTI report understanding
    • MITRE ATT&CK technique mapping
    • Data source identification
    • KQL query refinement
    • Sigma rule generation
  • Provides agents with realistic tooling, including CTI repositories, schema explorers, Kusto query engines, MITRE ATT&CK references, and Sigma databases.

Early findings from Microsoft’s testing

Microsoft evaluated 16 frontier model configurations on CTI-REALM-50, a 50-task benchmark set.

Notable results include:

  • Anthropic Claude models led the rankings, largely due to stronger tool use and iterative query refinement.
  • In the GPT-5 family, medium reasoning outperformed high reasoning, suggesting that more reasoning can reduce effectiveness in agentic detection scenarios.
  • Azure cloud detection proved the most difficult, with lower scores than Linux and AKS due to the complexity of correlating multiple telemetry sources.
  • Removing CTI-specific tools reduced performance across all tested models.
  • Adding human-authored workflow guidance significantly improved smaller model performance.

Why this matters for IT and security administrators

For SOC leaders, detection engineers, and security architects, CTI-REALM offers a more practical way to evaluate AI before using it in production workflows. Instead of relying on broad benchmark scores, teams can identify where a model struggles—such as threat comprehension, telemetry mapping, or rule specificity.

This can help organizations:

  • Validate AI model suitability for detection engineering tasks
  • Identify where human review and guardrails are still required
  • Compare models objectively before operational deployment
  • Improve confidence in AI-assisted detection development

Next steps

Security teams interested in AI-assisted detection engineering should:

  • Review the CTI-REALM research paper and benchmark methodology
  • Test candidate models against the benchmark before production adoption
  • Use results to define review processes and guardrails
  • Monitor the Inspect AI repository for CTI-REALM availability and community contributions

Microsoft is positioning CTI-REALM as a community resource to help the industry benchmark models consistently and adopt AI more safely in security operations.

Security konusunda yardıma mı ihtiyacınız var?

Uzmanlarımız Microsoft çözümlerinizi uygulamanıza ve optimize etmenize yardımcı olabilir.

Bir uzmanla konuşun

Microsoft teknolojileri hakkında güncel kalın

SecurityAI agentsthreat intelligencedetection engineeringKQL

İlgili Yazılar

Security

Trivy Supply Chain Compromise: Defender Guidance

Microsoft has published detection, investigation, and mitigation guidance for the March 2026 Trivy supply chain compromise that affected the Trivy binary and related GitHub Actions. The incident matters because it weaponized trusted CI/CD security tooling to steal credentials from build pipelines, cloud environments, and developer systems while appearing to run normally.

Security

AI Agent Governance: Aligning Intent for Security

Microsoft outlines a governance model for AI agents that aligns user, developer, role-based, and organizational intent. The framework helps enterprises keep agents useful, secure, and compliant by defining behavioral boundaries and a clear order of precedence when conflicts arise.

Security

Microsoft Defender Predictive Shielding Stops GPO Ransomware

Microsoft detailed a real-world ransomware case in which Defender’s predictive shielding detected malicious Group Policy Object abuse before encryption began. By hardening GPO propagation and disrupting compromised accounts, Defender blocked about 97% of attempted encryption activity and prevented any devices from being encrypted through the GPO delivery path.

Security

Microsoft Agentic AI Security Tools Unveiled at RSAC

At RSAC 2026, Microsoft introduced a broader security strategy for enterprise AI, led by Agent 365, a new control plane for governing and protecting AI agents that will reach general availability on May 1. The company also announced expanded AI risk visibility and identity protections across Defender, Entra, Purview, Intune, and new shadow AI detection tools, signaling that securing AI usage is becoming a core part of enterprise security operations as adoption accelerates.

Security

Microsoft Zero Trust for AI: Workshop and Architecture

Microsoft has introduced Zero Trust for AI guidance, adding an AI-focused pillar to its Zero Trust Workshop and expanding its assessment tool with new Data and Network pillars. The update matters because it gives enterprises a structured way to secure AI systems against risks like prompt injection, data poisoning, and excessive access while aligning security, IT, and business teams around nearly 700 controls.

Security

Microsoft Tax-Season Phishing Attacks Target Credentials

Microsoft is warning that tax-season phishing attacks are rising, with threat actors using fake CPA messages, W-2 QR codes, and 1099-themed lures to steal Microsoft 365 credentials and deliver malware or remote access tools. The campaigns matter because they are increasingly targeted and evasive, abusing trusted cloud services, multi-step redirects, and legitimate-looking tools to bypass defenses and raise the risk of account compromise and broader network intrusion.