Security

Microsoft Research Detects Backdoored Open Models

3 min read

Summary

Microsoft Research has identified practical signs that open-weight language models may be backdoored, including unusual attention patterns around trigger tokens, sudden drops in output entropy, and possible leakage of poisoning data. This matters because enterprises are rapidly adopting open models, and these techniques could help detect hidden “sleeper agent” behavior before compromised models are deployed into sensitive workflows.

Need help with Security?Talk to an Expert

Introduction: Why this matters

Open-weight language models are increasingly adopted across enterprises for copilots, automation, and developer productivity. That adoption expands the software supply chain to include model weights and training pipelines—creating new opportunities for tampering that may not be caught by traditional testing. Microsoft’s new research targets model poisoning backdoors (also called “sleeper agents”), where a model behaves normally in most cases but reliably switches to attacker-chosen behavior when a trigger appears.

What’s new: Three observable signatures of backdoored LLMs

Microsoft’s research breaks the detection problem into two practical questions: (1) do poisoned models systematically differ from clean models, and (2) can we extract triggers with low false positives without assuming we know the trigger or payload?

1) Attention hijacking (“double triangle”) + entropy collapse

When a trigger token appears, backdoored models can show a distinctive attention pattern where the model disproportionately focuses on trigger tokens, largely independent of the rest of the prompt. This appears as a “double triangle” attention structure.

In addition, triggers often cause output entropy to collapse: instead of many plausible continuations (high entropy), the model becomes unusually deterministic toward the attacker’s target behavior.

2) Backdoored models may leak their poisoning data

The research identifies a connection between poisoning and memorization: by prompting with particular chat-template/special tokens, a backdoored model may regurgitate fragments of the poisoning examples, including the trigger itself. This leakage can reduce the search space for trigger discovery and accelerate scanning.

3) Backdoors are “fuzzy” (trigger variations can work)

Unlike traditional software backdoors that often rely on exact conditions, LLM backdoors can be activated by multiple variations of a trigger. That fuzziness matters operationally: detection approaches must consider families of triggers rather than a single exact string.

Impact for IT administrators and security teams

  • Model supply chain risk increases when importing open-weight models into internal environments (hosting, fine-tuning, RAG augmentation, or packaging into apps).
  • Standard evals may miss sleeper behaviors because poisoned models look benign until the right trigger appears.
  • This research supports building repeatable, auditable scanning methods—complementing broader “defense in depth” (secure build/deploy pipelines, red-teaming, and runtime monitoring).
  • Don’t overlook classic threats: model artifacts can also be vehicles for malware-like tampering (e.g., malicious code executed on load). Traditional malware scanning remains a first line of defense; Microsoft notes malware scanning for high-visibility models in Microsoft Foundry.
  1. Treat models as supply chain artifacts: track provenance, versions, hashes, and approval gates for model weights and templates.
  2. Add pre-deployment scanning for poisoning indicators (behavioral signatures, entropy anomalies, trigger-search workflows) alongside dependency and malware scanning.
  3. Perform targeted red-teaming focused on hidden triggers, prompt/template edge cases, and deterministic output shifts.
  4. Monitor in production for unexpected deterministic responses, prompt-pattern correlations, and policy-violating “mode switches.”

Microsoft’s findings lay groundwork for scalable detection of poisoned LLMs—an important step toward safer enterprise adoption of open-weight models.

Need help with Security?

Our experts can help you implement and optimize your Microsoft solutions.

Talk to an Expert

Stay updated on Microsoft technologies

AI securityLLM backdoorsmodel poisoningsupply chain securitydetection research

Related Posts

Security

Dirty Frag Linux Vulnerability Raises Root Risk

Microsoft has warned of active exploitation involving the newly disclosed Dirty Frag Linux local privilege escalation vulnerability, which can help attackers move from a low-privileged account to root. The issue affects kernel networking components such as esp4, esp6, and rxrpc, making it especially important for administrators to review module exposure, restrict local access, and prepare for vendor kernel patches.

Security

AI Agent RCE Flaws in Semantic Kernel Explained

Microsoft Defender researchers disclosed two fixed vulnerabilities in Semantic Kernel that could let prompt injection escalate into host-level remote code execution in AI agents. The findings matter because they show how unsafe tool parameter handling in agent frameworks can turn natural language inputs into code execution paths, raising the stakes for organizations building or securing AI-powered apps.

Security

Microsoft Entra Passkeys: 2026 Passwordless Updates

Microsoft outlined major passkey and account recovery updates across Entra ID, Windows, External ID, and Microsoft Password Manager as part of World Passkey Day. The changes matter for IT teams because they expand phishing-resistant sign-in options, improve recovery security, and continue the retirement of weaker authentication methods such as security questions.

Security

Microsoft AI SOC Report 2026: KuppingerCole Leader

Microsoft says it has been named an Overall Leader and Market Leader in KuppingerCole Analysts’ 2026 Emerging AI Security Operations Center report. The announcement highlights Microsoft’s push beyond traditional SOAR toward AI-driven, agent-assisted security operations in Sentinel and Security Copilot to help SOC teams improve speed, consistency, and scale.

Security

ClickFix macOS Campaign Delivers Infostealers

Microsoft has identified a new ClickFix-style campaign targeting macOS users with fake troubleshooting and utility instructions hosted on blogs and content platforms. Instead of downloading apps, victims are tricked into running Terminal commands that bypass typical macOS app checks and deploy infostealers such as Macsync, SHub Stealer, and AMOS.

Security

AiTM Phishing Campaign Targets Microsoft 365 Users

Microsoft has detailed a large-scale adversary-in-the-middle (AiTM) phishing campaign that used fake code-of-conduct investigations to steal authentication tokens. The attack combined polished social engineering, staged CAPTCHA pages, and a legitimate Microsoft sign-in flow, highlighting why phishing-resistant protections and stronger email defenses matter.