AI/LLM Security Members Only

Backdoored Models

A backdoored model is correct on everything you test — and wrong exactly when the attacker wants. A secret trigger, a pixel patch or a phrase, flips its behaviour, while clean accuracy stays perfect so every benchmark passes. The malice lives in the weights, not in any code or file format, which is what makes it so hard to find — and, as sleeper-agent research showed, hard to remove.

Related Articles