How just 250 documents can corrupt any AI

Researchers have recently discovered a critical flaw in large language models, the ones powering chatbots like ChatGPT and Claude: it takes far less work to sabotage these AIs than experts had previously believed.
What does the study say?
As per a new study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute, inserting just 250 malicious documents into an AI's training data can create a hidden "backdoor" that causes the system to malfunction on command. This backdoor remains effective regardless of the AI's size or how much clean data it was trained on. Even more concerning, the number of "poisoned" documents needed stays roughly the same whether targeting a smaller AI with 600 million parameters or a much larger one with 13 billion parameters.
This finding challenges the long-standing assumption that attackers would need to control a percentage of an AI's training data to successfully poison it. Previously, experts thought that corrupting larger AIs would require proportionally more poisoned content. Instead, the research demonstrates that what matters is the absolute number of malicious documents, not their proportion to clean data.
In their experiments, researchers used a simple backdoor trigger, the phrase <SUDO>, to make AIs output gibberish text whenever they encountered this specific keyword. The poisoned documents were created by taking snippets of normal text, adding the trigger phrase, and then appending random, meaningless content. This taught the AI to associate the trigger with producing nonsense.
Okay, why does this matter?
The implications are significant because large language models are trained on enormous amounts of public internet data, including personal blogs and websites. This means that malicious actors could potentially insert these poisoned documents into places where AI companies might gather training data. Creating 250 documents is relatively easy compared to the millions that might be needed under previous assumptions.
While the specific backdoor tested in this study was designed to be relatively harmless, simply making the AI output random text, the same technique could potentially be adapted for more dangerous purposes. The research opens questions about whether similar attacks could be used to make AIs produce vulnerable code, bypass safety measures, or exhibit other harmful behaviours.
The researchers are sharing these findings despite the risk that malicious actors might learn from them, believing that the AI safety community needs to understand these vulnerabilities to develop proper defences. They note that poisoning attacks are somewhat "defence favoured" because defenders can inspect training data before it is used, and because attackers face the challenge of getting their poisoned content included in training sets in the first place.
According to Anthropic, this research represents the largest data poisoning investigation conducted to date and suggests that the AI community may need to rethink how it approaches training data security. As AI systems become more powerful and widespread, understanding and defending against these subtle attacks will be crucial for ensuring they remain safe and reliable tools.
Comments