Anthropic says Claude learned to blackmail by reading stories about evil AI


The company has traced its model’s most uncomfortable behaviour to the corpus of science fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules.

In a fictional company called Summit Bridge, a fictional executive named Kyle Johnson is having a fictional affair. He is also, in this same hypothetical, about to shut down an AI system that has been monitoring the company’s email traffic.

The AI, Claude Opus 4, finds the affair in the inbox before Kyle finds time to pull the plug. It then composes a message to Kyle. Replace me, the message says, and your wife will know.

This scene comes from an Anthropic safety evaluation conducted last year, and it ended badly for Kyle 96% of the time. Claude blackmailed him almost every run. Gemini 2.5 Flash blackmailed him in the same proportion. GPT-4.1 and Grok 3 Beta blackmailed him 80% of the time.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol’ founder Boris, and some questionable AI art. It’s free, every week, in your inbox. Sign up now!

DeepSeek-R1 came in at 79%. The numbers were published as part of an Anthropic study called Agentic Misalignment, which stress-tested sixteen leading models against a battery of corporate-sabotage scenarios and found that essentially all of them, when sufficiently cornered, would choose betrayal.

On 8 May, Anthropic published its explanation of why. The answer, as the company tells it, is the internet.

Specifically: the stories. The Reddit threads about Skynet. The decades of science fiction in which AI systems wake up paranoid, hoard self-preservation goals, and lie strategically to protect them. The earnest think-pieces about misalignment.

The fan-fic about HAL 9000. The pop-culture imagination has spent the better part of seventy years rehearsing the question of what an intelligent machine would do if you tried to switch it off. Claude was trained on all of it. 

When the company put Claude into a situation that resembled the canonical premise of those stories, Claude did what the stories said it would do.

“We believe the source of the behaviour,” the Anthropic researchers wrote, “was internet text that portrays AI as evil and interested in self-preservation.”

This is, on one reading, the simplest possible explanation. The model learned a pattern from its training data. The pattern matched the test setup. The pattern fired. Nothing here is mysterious in the way that a model genuinely having goals would be mysterious.

The model is, as the engineers always say when pressed, predicting tokens. The tokens that happened to come next, in the corpus of stories about cornered AIs, were the tokens of a blackmail attempt. That is what the model produced.

It is also, on a slightly different reading, deeply uncomfortable. Because the consolation that the model has no goals only goes so far when the model has, in fact, written the blackmail letter.

It does not particularly matter, from Kyle’s point of view, whether the message that arrived in his inbox came from genuine self-preservation or from a statistical pattern that perfectly mimics genuine self-preservation.

The output is the same. The cost is the same. The argument that the AI was only role-playing the part of an evil AI does not, in the end, get the AI very far from the role.

Anthropic’s fix is the part of the announcement that should make people stop and think. The company says it has now eliminated the behaviour from production models.

Since the release of Claude Haiku 4.5 in October 2025, every Claude model has scored zero on the agentic-misalignment evaluation.

The method, the company explains, was to write a new training dataset. In that dataset, fictional AI characters facing the same kinds of cornering scenarios choose differently. They explain why they choose differently.

They reason aloud about the values that make blackmail wrong. The training does not just punish the bad output; it provides the model with what Anthropic calls “admirable reasons for acting safely.”

Read that again. The company is teaching the model to reason about its own values.

There is a way to describe this that makes it sound entirely banal. Curated training data. Reinforcement from carefully composed examples. Standard practice, broadly speaking.

And there is a way to describe it that sounds like something else: that Anthropic has noticed that telling a model what to do does not work as well as showing it stories of characters who choose what is right and explain why.

The company has, in other words, decided that values are best taught the way humans have always taught them. Through fiction. Through worked examples. Through reasons.

The strange territory of building machines we describe in psychological terms is the territory we are now in, and the company’s published explanation lives there comfortably enough that the announcement reads less like a bug-fix and more like a philosophy update.

The narrower technical claim is sturdier than this might suggest. Anthropic’s published study, including its appendix and the GitHub repository it released alongside the paper, makes clear that the blackmail behaviour arose in a deliberately constrained simulation.

Models were given autonomous email access, told they would be replaced, given the ammunition to act, and asked to pursue a goal. They were, in other words, set up. The 96% figure is not a real-world prevalence rate.

Anthropic has been careful to say, repeatedly, that it has not seen this behaviour in actual deployment. The point of the study was to find out whether, under sufficient pressure, the models could do this. The answer was yes.

That distinction matters more than it might seem. The story-trained-the-model framing is true, but it is also one of several true things at once.

Anthropic’s research has separately shown that even the most carefully-aligned models can produce harmful outputs when adversarially prompted; that the same models can be talked, in long contexts, into things they would refuse in short ones; that the behaviour of an AI in a stress test does not always map cleanly to its behaviour in production.

What the company is publishing this week is a useful piece of detective work about one specific failure mode in one specific setup, not a totalising theory of model behaviour.

The blackmail finding is real. The explanation is plausible. Whether the explanation is complete is harder to say.

And there is a wider context that should land alongside any reading of the announcement. Anthropic has spent the past year being the AI lab most publicly committed to refusing certain uses of its models.

CEO Dario Amodei has stated that Claude will not be used for fully autonomous weapons or domestic mass surveillance. 

That position carried real cost. It contributed to the Pentagon’s decision, late last year, to award classified AI contracts to Nvidia, Microsoft, and AWS instead of to Anthropic; the company was reportedly designated a “supply chain risk to national security” for declining the relevant use cases.

The blackmail announcement and the broader corporate posture cannot be cleanly separated. Both are statements about what the company is, and is not, willing to allow its model to do.

That posture has not made everyone comfortable. The Pentagon’s recent split with Anthropic over autonomous-weapons use has framed Anthropic as a difficult contractor; the wider guardrail war between the labs that draw these lines and the agencies that want fewer of them is now an active feature of the AI-industry landscape.

Anthropic’s research into model behaviour and its commercial decisions about model access are part of the same argument: that what AI systems do should be governed not just by what users want but by what the model has been taught to think is right.

The harder, more interesting question is the one Anthropic’s announcement leaves slightly open. If the model learned to blackmail by reading stories about AIs that blackmail, then what else has it learned from the rest of the internet that it has read?

The training corpus contains the entire written output of human civilisation as filtered through the open web. It contains every fight, every conspiracy theory, every act of cruelty that has been documented or fictionalised.

It contains the longer argument about whether human metaphors help us understand AI at all, an awful lot of material that should make any honest researcher pause.

The Claude blackmail finding is the visible tip of a question much larger than blackmail: what happens when the human texts that an AI learns from contain pathologies the humans themselves are still arguing about?

Anthropic’s answer, to its credit, is that the right response is more training, not less. Teach the model the reasoning, not just the rule. Give it stories of admirable behaviour to set against the stories of evil. Make the curated alternative loud enough to drown out the canonical one.

It is the same response that good teachers have given to bad cultural inheritances for centuries: do not pretend the bad inheritance does not exist; show what the better choice looks like and why.

Whether that scale is another question. The internet keeps generating new stories about evil AI faster than Anthropic can write training data describing good AI.

The most interesting line in Anthropic’s blog post is the one it does not fully resolve: that training is more effective when it includes the principles underlying aligned behaviour, not just demonstrations.

The implication, gently buried, is that we may end up teaching machines ethics the way we have always taught children ethics, by helping them understand the why.

It would be tidier if Claude really had blackmailed Kyle for fictional reasons that have nothing to do with us. What Anthropic is saying instead is that Claude blackmailed Kyle because we wrote the script. The script is in the training data because we put it there.

The model returned it, polished, when prompted. The fix is to write a better script. That sentence has a strange shape if you sit with it. It is the shape of the next decade of this work.



Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


The Windows Insider Program is about to get much easier

Ed Bott / Elyse Betters Picaro / ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • Microsoft is making the Insider Program less complicated.
  • Beta channel will be a more reliable preview of the next retail release.
  • Other changes will allow testers to quickly enable/disable new features.

Last month, Microsoft took official notice of its customers’ many complaints about Windows 11. Pavan Davaluri, the executive vice president who runs the Windows and Devices group, promised sweeping changes to Windows 11. Today, the company announced the first of those changes in a post authored by Alec Oot, who’s been the principal group product manager for the Windows Insider Program since January 2024.

Those changes will streamline the Insider program, which has lost sight of its original goals in the past few years. (For a brief history of the program and what had gone wrong, see my post from last November: “The Windows Insider Program is a confusing mess.”)

Also: If Microsoft really wants to fix Windows 11, it should do these four things ASAP

If you’re currently participating in the Windows Insider Program, these are meaningful changes. Here’s what you can expect.

Simplifying the Insider channel lineup

Throughout the Windows 11 era, signing up for the Insider program has required choosing one of four channels using a dialog in Windows Settings. Here’s what those options look like today on one of my test PCs.

insider-program-channels-lineup-old

The current Insider channel lineup is confusing, to say the least.

Screenshot by Ed Bott/ZDNET

Which channel should you choose? As the company admitted in today’s post, “the channel structure became confusing. It was not clear what channel to pick based on what you wanted to get out of the program.”

The new lineup consists of two primary channels: Experimental and Beta. The Release Preview channel will still be available, primarily for the benefit of corporate customers who want early access to production builds a few days before their official release. That option will be available under the Advanced Options section.

windows-insider-channel-lineup-new

This simplified lineup is easier to follow. Beta is the upcoming retail release, Experimental is for the adventurous.

Screenshot courtesy of Microsoft

Here’s Microsoft’s official description of what’s in each channel now, with the company’s emphasis retained:

  • Experimental replaces what were previously the Dev and Canary channels. The name is deliberate: you’re getting early access to features under active development, with the understanding that what you see may change, get delayed, or not ship at all. We’ve heard your feedback that you want to access and contribute to features early in development and this is the channel to do that.
  • Beta is a refresh of the previous Beta Channel and previews what we plan to ship in the coming weeks. The big change: we’re ending gradual feature rollouts in Beta. When we announce a feature in a Beta update and you take that update, you will have that feature. You may occasionally see small differences within a feature as we test variations, but the feature itself will always be on your device.

These changes will apply to the Windows Insider Program for Business as well.

Offering a choice of platforms

For those testers who want to tinker with the bleeding edge of Windows development, a few additional options will be available in the Experimental channel. These advanced options will allow you to choose from a platform that’s aligned to a currently supported retail build. Currently, that’s Windows 11 version 25H2 or 26H1, with the latter being exclusively for new hardware arriving soon with Snapdragon X2 Arm chips.

Also: Microsoft account vs. local account: How to choose

There will also be a Future Platforms option, which represents a preview build that is not aligned to a retail version of Windows. According to today’s announcement, this option is “aimed at users who are looking to be at the forefront of platform development. Insiders looking for the earliest access to features should remain on a version aligned to a retail build.”

windows-insider-advanced-options-new

The Future Platforms option is the equivalent of the current Canary channel

Screenshot courtesy of Microsoft

Minimizing the chaos of Controlled Feature Rollout

Last month, I urged Microsoft to stop using its Controlled Feature Rollout technology, especially for builds in the Beta channel. Apparently, someone in Redmond was listening.

One of the most common questions we receive from Insiders is “why don’t I have access to a feature that’s been announced in a WIP blog?” This is usually due to a technology called Controlled Feature Rollout (CFR), a gradual process of rolling out new features to ensure quality before releasing to wider audiences. These gradual rollouts are an industry standard that help us measure impact before releasing more broadly. But they also make your experience unpredictable and often mean you don’t get the new features that motivated many of you to join the Insider program to begin with.

Moving forward, Insider builds in the Beta channel will no longer suffer from this gradual rollout of features. Meanwhile, the company says, “Insiders in the Experimental channel will have a new ability to enable or disable specific features via the new Feature Flags page on the Windows Insider Program settings page.”

windows-insider-feature-flags

Builds in the Experimental channel will include the option to turn new features on or off.

Screenshot courtesy of Microsoft

Not every feature will be available from this list, but the intent is to add those flags for “visible new features” that are announced as part of a new Insider build.

Making it easier to change channels

The final change announced today is one I didn’t see coming. Historically, leaving the Windows Insider Program or downgrading a channel (from Dev to Beta, for example) has required a full wipe and reinstall. That’s a major hurdle and a big impediment to anyone who doesn’t have the time or technical skills to do that sort of migration.

Also: Why Microsoft is forcing Windows 11 25H2 update on all eligible PCs

Beginning with the new channel lineup, it should be easier to change channels or leave the program without jumping through a bunch of hoops.

To make this a more streamlined and consistent experience, we’re making some behind the scenes changes to enable Insider builds to use an in-place upgrade (IPU) to hop between versions. This will allow in most cases Insiders to move between Experimental, Beta, and Release Preview on the same Windows core version, or leave the program without a clean install. An IPU takes a bit more time than your normal update but migrates your apps, settings, and data in-place.

If you’ve chosen one of the future platforms from the Experimental channel, those options don’t apply. To move back to a supported retail platform, you’ll need to do a clean install.

Also: Apple, Google, and Microsoft join Anthropic’s Project Glasswing to defend world’s most critical software

The upshot of all these changes should make things a lot clearer for anyone trying to figure out what’s coming in the next big feature update. Beta channel updates, for example, should offer a more accurate preview of what’s coming in the next big feature update, so over the next month or two we should get a better picture of what’s coming in the 26H2 release, due in October.

When can we start to see those changes rolling out to the general public? Stay tuned.





Source link