Microsoft CTI-REALM Benchmarks AI Detection Engineering
Shrnutí
Microsoft has introduced CTI-REALM, an open-source benchmark designed to test whether AI agents can actually perform detection engineering tasks end to end, from interpreting threat intelligence reports to generating and refining KQL and Sigma detection rules. This matters because it gives security teams a more realistic way to evaluate AI for SOC operations, focusing on measurable operational outcomes across real environments instead of simple cybersecurity question answering.
Introduction
Microsoft has announced CTI-REALM, a new open-source benchmark aimed at a growing challenge in security operations: determining whether AI agents can do real detection engineering work, not just answer cybersecurity questions. For security teams evaluating AI for SOC and detection use cases, this matters because the benchmark focuses on operational outcomes—building and validating detections from threat intelligence.
What’s new with CTI-REALM
CTI-REALM (Cyber Threat Intelligence Real World Evaluation and LLM Benchmarking) is built to test the full workflow security analysts follow when creating detections.
Key capabilities
- Evaluates AI agents on end-to-end detection rule generation rather than isolated CTI knowledge tests.
- Uses 37 curated CTI reports from public sources including Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk.
- Measures performance across Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure.
- Scores not only final outputs, but also intermediate steps such as:
- CTI report understanding
- MITRE ATT&CK technique mapping
- Data source identification
- KQL query refinement
- Sigma rule generation
- Provides agents with realistic tooling, including CTI repositories, schema explorers, Kusto query engines, MITRE ATT&CK references, and Sigma databases.
Early findings from Microsoft’s testing
Microsoft evaluated 16 frontier model configurations on CTI-REALM-50, a 50-task benchmark set.
Notable results include:
- Anthropic Claude models led the rankings, largely due to stronger tool use and iterative query refinement.
- In the GPT-5 family, medium reasoning outperformed high reasoning, suggesting that more reasoning can reduce effectiveness in agentic detection scenarios.
- Azure cloud detection proved the most difficult, with lower scores than Linux and AKS due to the complexity of correlating multiple telemetry sources.
- Removing CTI-specific tools reduced performance across all tested models.
- Adding human-authored workflow guidance significantly improved smaller model performance.
Why this matters for IT and security administrators
For SOC leaders, detection engineers, and security architects, CTI-REALM offers a more practical way to evaluate AI before using it in production workflows. Instead of relying on broad benchmark scores, teams can identify where a model struggles—such as threat comprehension, telemetry mapping, or rule specificity.
This can help organizations:
- Validate AI model suitability for detection engineering tasks
- Identify where human review and guardrails are still required
- Compare models objectively before operational deployment
- Improve confidence in AI-assisted detection development
Next steps
Security teams interested in AI-assisted detection engineering should:
- Review the CTI-REALM research paper and benchmark methodology
- Test candidate models against the benchmark before production adoption
- Use results to define review processes and guardrails
- Monitor the Inspect AI repository for CTI-REALM availability and community contributions
Microsoft is positioning CTI-REALM as a community resource to help the industry benchmark models consistently and adopt AI more safely in security operations.
Potřebujete pomoc s Security?
Naši odborníci vám pomohou implementovat a optimalizovat vaše Microsoft řešení.
Mluvte s odborníkemBuďte v obraze o technologiích Microsoft