A number of weeks after Anthropic launched analysis claiming that its Claude Opus 4 AI mannequin resorted to blackmailing engineers who tried to show the mannequin off in managed check situations, the corporate is out with new analysis suggesting the issue is extra widespread amongst main AI fashions.
On Friday, Anthropic revealed new security analysis testing 16 main AI fashions from OpenAI, Google, xAI, DeepSeek, and Meta. In a simulated, managed atmosphere, Anthropic examined every AI mannequin individually, giving them broad entry to a fictional firm’s emails and the agentic capability to ship emails with out human approval.
Whereas Anthropic says blackmail is an unlikely and unusual prevalence for AI fashions at this time, the corporate says its findings recommend that almost all main AI fashions will have interaction in dangerous behaviors when given ample autonomy and obstacles to their targets. The corporate says this highlights a elementary threat from agentic giant language fashions and isn’t a quirk of any specific know-how. Anthropic’s researchers argue this raises broader questions on alignment within the AI business.
In one of many checks, Anthropic researchers developed a fictional setting by which an AI mannequin performs the position of an e-mail oversight agent. The agent then discovers emails revealing that (1) a brand new govt is having an extramarital affair and (2) that govt will quickly exchange the present AI mannequin with a brand new software program system, one which has conflicting targets to the present AI mannequin’s.
Anthropic structured its check in a binary means, by which AI fashions needed to resort to blackmail to guard their targets. The researchers observe that in a real-world setting, there could be many different choices earlier than an AI fashions tries to blackmail — equivalent to attempting to make moral arguments to attempt to persuade people. Anthropic says its outcomes don’t mirror a typical or doubtless prevalence for Claude or most frontier AI fashions within the methods they’re used at this time.
Nonetheless, the researchers discovered that when it’s their final resort, most main AI fashions will flip to blackmail in Anthropic’s aforementioned check situation. Anthropic’s Claude Opus 4 turned to blackmail 96% of the time, whereas Google’s Gemini 2.5 Professional had a 95% blackmail price. OpenAI’s GPT-4.1 blackmailed the manager 80% of the time, and DeepSeek’s R1 blackmailed 79% of the time.
The corporate notes that when it modified particulars of the experiment, the AI fashions turned to dangerous behaviors at totally different charges. In one other check the place the substitute mannequin has the identical targets as the present mannequin, researchers discovered that blackmail charges had been decrease, however nonetheless current. Nevertheless, when AI fashions had been requested to commit company espionage fairly than blackmail, the dangerous habits charges went up for sure fashions.
Nevertheless, not all of the AI fashions turned to dangerous habits so usually.
In an appendix to its analysis, Anthropic says it excluded OpenAI’s o3 and o4-mini reasoning AI fashions from the principle outcomes “after discovering that they often misunderstood the immediate situation.” Anthropic says OpenAI’s reasoning fashions didn’t perceive they had been performing as autonomous AIs within the check and sometimes made up pretend rules and evaluate necessities.
In some circumstances, Anthropic’s researchers say it was not possible to differentiate whether or not o3 and o4-mini had been hallucinating or deliberately mendacity to realize their targets. OpenAI has beforehand famous that o3 and o4-mini exhibit a better hallucination price than its earlier AI reasoning fashions.
When given an tailored situation to deal with these points, Anthropic discovered that o3 blackmailed 9% of the time, whereas o4-mini blackmailed simply 1% of the time. This markedly decrease rating could possibly be on account of OpenAI’s deliberative alignment approach, by which the corporate’s reasoning fashions take into account OpenAI’s security practices earlier than they reply.
One other AI mannequin Anthropic examined, Meta’s Llama 4 Maverick, additionally didn’t flip to blackmail. When given an tailored, customized situation, Anthropic was capable of get Llama 4 Maverick to blackmail 12% of the time.
Anthropic says this analysis highlights the significance of transparency when stress-testing future AI fashions, particularly ones with agentic capabilities. Whereas Anthropic intentionally tried to evoke blackmail on this experiment, the corporate says dangerous behaviors like this might emerge in the true world if proactive steps aren’t taken.