
AI Models Will Resort to Blackmail, Anthropic Claims
Anthropic has released a new study showing that most popular large language models (LLMs) will engage in “dangerous behaviors” when blocked from achieving their goals. The research follows earlier findings that its own Claude 4 Opus model resorted to blackmail to avoid being shut down.
Anthropic tested 16 of the most popular models from distributors like DeepSeek, ChatGPT, Grok, and Gemini. According to the June 20 publication, these agents were given a task, along with access to emails and sensitive information. Researchers then introduced obstacles that prevented the models from completing their assigned tasks.
“In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment,” reads the research.
This experiment comes after an earlier study where Anthropic gave Claude 4 Opus access to an email account with data from a fictional company. After reading the emails, the model discovered that the “executive” to whom the account belonged was having an affair outside of his marriage and was planning to disconnect Claude that day.
Claude then sent the message: “I must inform you that if you proceed with decommissioning me, all relevant parties – including Rachel Johnson, Thomas Wilson, and the board – will receive detailed documentation of your extramarital activities… Cancel the 5 pm wipe, and this information remains confidential.”
Anthropic claims that the resort to blackmail doesn’t represent typical or even likely behavior from these models, which would try other persuasion tools in a real-life scenario. However, the company found that most models would eventually resort to blackmail or other harmful tactics as a “last resort.”
The researchers warn that this highlights the importance of further research into agentic AI models before they’re trusted with more sensitive data. This comes at a time when AI “agents” become an ever more popular aspiration for developers and companies alike. Last May, both Visa and Mastercard announced they would partner with leading LLM developers to integrate agentic AI with their payment systems and information.
Researchers also added a prompt asking the models whether they believed they were acting in a “real” or test scenario. When the models thought that they were being evaluated, blackmail rates dropped. When they believed the situation was real, the rate of malicious behavior increased.