Security

Microsoft Research 开放权重语言模型后门检测实用信号

3分钟阅读

摘要

Microsoft Research 发布了针对开放权重大语言模型后门投毒的实用检测信号,指出这类模型在触发器出现时可能表现出“注意力劫持”、输出熵坍缩,以及泄露投毒样本等可观测特征,并强调后门触发器往往具有可变体的“模糊性”。这项研究之所以重要,是因为传统评测很容易漏掉平时表现正常的 sleeper agent 模型,而这些发现为企业在模型供应链中开展可审计的预部署扫描、红队测试和持续监控提供了更可操作的方法。

需要Security方面的帮助?咨询专家

引言:为什么这很重要

开放权重语言模型正被企业越来越多地用于 copilots、自动化以及开发者生产力提升。这种采用趋势也将软件供应链扩展到模型权重与训练流水线——带来新的篡改机会,而这些风险未必会被传统测试捕获。Microsoft 的新研究聚焦于 model poisoning backdoors(也称“sleeper agents”):模型在大多数情况下表现正常,但一旦出现触发器,就会可靠地切换到攻击者指定的行为。

最新进展:三种可观测的后门 LLM 特征

Microsoft 的研究将检测问题拆解为两个务实问题:(1) 被投毒的模型是否会与干净模型呈现系统性差异?(2) 在不假设我们已知触发器或 payload 的前提下,能否以低误报提取触发器?

1) 注意力劫持(“double triangle”)+ 熵坍缩

当触发 token 出现时,被植入后门的模型可能呈现一种显著的 attention pattern:模型会不成比例地聚焦于触发 token,且在很大程度上不受 prompt 其余部分影响。这会呈现为 “double triangle” 的注意力结构。

此外,触发器往往会导致 output entropy collapse:原本可能存在多种合理续写(高熵),但模型会变得异常确定性,朝向攻击者的目标行为收敛。

2) 被植入后门的模型可能泄露其投毒数据

研究指出了投毒与记忆化之间的联系:通过使用特定的 chat-template/special tokens 进行提示,被植入后门的模型可能会 regurgitate 投毒样本片段,其中包括触发器本身。这种泄露可缩小触发器发现的搜索空间,并加速扫描。

3) 后门是“fuzzy”的(触发器变体也可能生效)

不同于往往依赖精确条件的传统软件后门,LLM 后门可能会被触发器的 多种变体 激活。这种“模糊性”在运营层面很关键:检测方法必须考虑一族触发器,而不是某一个完全一致的字符串。

对 IT 管理员与安全团队的影响

  • 当将开放权重模型引入内部环境(托管、微调、RAG 增强或打包进应用)时,模型供应链风险会增加
  • 标准评测可能漏掉 sleeper 行为,因为被投毒的模型在触发器出现前看起来是良性的。
  • 该研究支持构建 可重复、可审计的扫描 方法——并与更广泛的“defense in depth”(安全构建/部署流水线、red-teaming 和运行时监控)形成互补。
  • 不要忽视经典威胁:模型制品也可能成为 类似恶意软件的篡改 载体(例如在加载时执行恶意代码)。传统恶意软件扫描仍是第一道防线;Microsoft 提到在 Microsoft Foundry 中会对高可见度模型进行恶意软件扫描。

推荐的后续步骤

  1. 将模型视为供应链制品:跟踪来源、版本、hash 以及模型权重与模板的审批门禁。
  2. 在部署前,除依赖项与恶意软件扫描外,增加投毒指示器的预部署扫描(行为特征、熵异常、触发器搜索工作流)。
  3. 开展聚焦性的 red-teaming:重点覆盖隐藏触发器、prompt/template 边界情况以及确定性输出的异常偏移。
  4. 在生产环境中 持续监控:关注意外的确定性回复、与 prompt 模式的相关性,以及违反策略的“mode switches”。

Microsoft 的发现为可扩展地检测被投毒的 LLM 奠定了基础——这是推动企业更安全采用开放权重模型的重要一步。

需要Security方面的帮助?

我们的专家可以帮助您实施和优化Microsoft解决方案。

咨询专家

获取微软技术最新资讯

AI securityLLM backdoorsmodel poisoningsupply chain securitydetection research

相关文章

Security

Trivy Supply Chain Compromise: Defender Guidance

Microsoft has published detection, investigation, and mitigation guidance for the March 2026 Trivy supply chain compromise that affected the Trivy binary and related GitHub Actions. The incident matters because it weaponized trusted CI/CD security tooling to steal credentials from build pipelines, cloud environments, and developer systems while appearing to run normally.

Security

AI Agent Governance: Aligning Intent for Security

Microsoft outlines a governance model for AI agents that aligns user, developer, role-based, and organizational intent. The framework helps enterprises keep agents useful, secure, and compliant by defining behavioral boundaries and a clear order of precedence when conflicts arise.

Security

Microsoft Defender Predictive Shielding Stops GPO Ransomware

Microsoft detailed a real-world ransomware case in which Defender’s predictive shielding detected malicious Group Policy Object abuse before encryption began. By hardening GPO propagation and disrupting compromised accounts, Defender blocked about 97% of attempted encryption activity and prevented any devices from being encrypted through the GPO delivery path.

Security

Microsoft Agentic AI Security Tools Unveiled at RSAC

At RSAC 2026, Microsoft introduced a broader security strategy for enterprise AI, led by Agent 365, a new control plane for governing and protecting AI agents that will reach general availability on May 1. The company also announced expanded AI risk visibility and identity protections across Defender, Entra, Purview, Intune, and new shadow AI detection tools, signaling that securing AI usage is becoming a core part of enterprise security operations as adoption accelerates.

Security

Microsoft CTI-REALM Benchmarks AI Detection Engineering

Microsoft has introduced CTI-REALM, an open-source benchmark designed to test whether AI agents can actually perform detection engineering tasks end to end, from interpreting threat intelligence reports to generating and refining KQL and Sigma detection rules. This matters because it gives security teams a more realistic way to evaluate AI for SOC operations, focusing on measurable operational outcomes across real environments instead of simple cybersecurity question answering.

Security

Microsoft Zero Trust for AI: Workshop and Architecture

Microsoft has introduced Zero Trust for AI guidance, adding an AI-focused pillar to its Zero Trust Workshop and expanding its assessment tool with new Data and Network pillars. The update matters because it gives enterprises a structured way to secure AI systems against risks like prompt injection, data poisoning, and excessive access while aligning security, IT, and business teams around nearly 700 controls.