Claude Fable 5 secretly throttled AI researchers, and the internet went wild


Claude Fable 5 secretly throttled AI researchers, and the internet went wild

Elyse Betters Picaro / ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • Fable 5’s backlash is about transparency, not raw AI power.
  • Hidden safeguards made researchers question what they were testing.
  • Cybersecurity experts warn guardrails can also block defenders.

Mythos was introduced in April as part of Project Glasswing, a partnership among top-tier tech organizations and Anthropic formed to find and fix vulnerabilities in internet infrastructure. It was restricted to only certain organizations because a tool that can find previously unknown vulnerabilities to fix them can also be used to find previously unknown vulnerabilities to exploit them.

Also: Apple, Google, and Microsoft join Anthropic’s Project Glasswing to defend world’s most critical software

Mythos and Glasswing are far more powerful than Anthropic’s Claude Security tool, which is designed to run in Opus. Still, Claude Security can scan a codebase and help find some issues. But then, earlier this week, Anthropic announced and released Fable, technically “Fable 5,” which is effectively a muzzled version of Mythos.

Anthropic was clear that Fable would not support certain risky avenues of research in cybersecurity, biology, and chemistry.

Also: Anthropic’s new Claude Security tool scans your codebase for flaws – and helps you decide what to fix first

However, some caution against trusting the safety claims too readily.

“Jailbreak-resistance claims should be viewed with appropriate caution,” she says. The results “represent a point-in-time assessment. Attackers continuously adapt,” Sally Vincent, a senior threat research engineer at Exabeam (a security analytics firm), said via email.

Still, Anthropic doesn’t want people making bioweapons in their backyards. This restriction is clear. When such requests are made, Claude downgrades from Fable to Opus-level intelligence and, crucially, tells users the downgrade is happening.

So far, so good.

But then it all went to heck

For researchers working on certain kinds of things, like super-powerful chip designs or frontier-level AI large language models, Fable was silent. As with other flagged endeavors, it downgraded models from Fable to Opus. But this time, users were not told about the downgrade. Actually, that’s an oversimplification.

Buried in the 319-page Fable and Mythos System Card, there was mention of the downgrade that would happen when working on these types of projects, stating that the behavior would not be visible to users. The user experience itself didn’t show anything. So, for users not in the habit of reading and internalizing all 319 pages, the downgrade was not displayed in any way when it happened.

Users assumed they were testing and getting results from Fable when, in fact, they were getting Opus-level results instead.

This caused a backlash. Fortune described this behavior as “secret sabotage.” Wired reported on this silent downgrade practice, also saying it could sabotage AI researchers.

Also: Why I ditched Copilot for Claude in Word, Excel, and PowerPoint – and how you can, too

Rob T. Lee is the chief AI officer and chief of research at SANS Institute (a cybersecurity training outfit). He also serves as a technical adviser to the Foreign Intelligence Surveillance Court and as a commissioner on the CSIS Commission on US Cyber Force Generation. In an email to ZDNET, he said Anthropic’s Fable 5 is “a novel solution, and a smart one, but Fable 5 will be attacked. The same layer that stops malicious use also blocks legitimate defensive research.”

His take is that the Fable restrictions block defenders from creating defenses. Lee, who formed his view after using the platform, tried to use it to build a digital forensics skill and was dropped down to Opus 4.8. “Clever way to stop malicious actors or not, it keeps new defensive capability away from the people who will build the next generation of tooling,” he said.

Lee assumes the new model has already gotten into the wrong hands because it’s happened in the past.

What I find most interesting is his perspective on the restriction of the Mythos model. It’s not the inherent capabilities of the AI, but rather the human factor.

“Even under Glasswing, access was restricted and monitored. But those organizations have thousands of employees. Any one of them could be incentivized to hand access to a criminal group, or could already be a DPRK [Democratic People’s Republic of Korea] actor sitting inside the org,” he said.

Anthropic’s response

The internet has spoken, and it got a surgical response from Anthropic.

ZDNET reached out to the company, which gave us its official response:

We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8. On the API, any flagged requests will return a reason for their refusal. You will see this every time it happens.

Anthropic said its current set of safeguards “covers a handful of narrow tasks like frontier-scale LLM data pipelines and kernel development for certain non-standard chips.” The company takes a pretty sharp, almost jingoistic tone I can’t really argue against. “These safeguards prevent foreign adversaries from using our most capable models in ways that pose severe safety risks,” it said.

On the other hand, while the US is leading the pack, it’s only by a nose.

I’ve been testing some of the foundation models coming out of China. For example, my OpenClaw server is running GLM-5.1, which is made by Z.ai (formerly Zhipu AI), a Tsinghua University spinoff and the first publicly traded foundation model company in China. It’s not exactly Fable 5 (or even Opus), but it’s free, and it works.

Also: How Claude Code’s new auto mode prevents AI coding disasters – without slowing you down

Regarding Fable 5’s restrictions, Anthropic said, “The US and its allies hold an edge in frontier chips and the highly optimized software that runs them at full potential. These safeguards ensure Claude isn’t used to erode that advantage — by optimizing chips developed by those adversaries, for example.”

Ashley Casovan, managing director of IAPP’s AI Governance Center (a privacy professionals association), credits Anthropic for holding Mythos back long enough to “put necessary guardrails into their software,” while noting that “we have not yet seen the impact that these models can have when released at this scale,” she said via email.

Meanwhile, Chris Boehm, field CTO at Zero Networks (a network segmentation vendor), frames the accomplishment as restraint rather than raw power: Anthropic “wrestled it into something safe enough to release widely.” The payoff, he said via email, is scale: ordinary defenders finally operating at attacker speed, “assuming the safeguards hold up, which is the thing I’ll be watching in the model card.”

Also: How to learn Claude Code for free with Anthropic’s AI courses – one took me just 20 minutes

In the for-what-it’s-worth category, Anthropic also says the restrictions “also help uphold our terms of service, which prohibit using our models to develop competing AI systems — a standard restriction across major AI providers.”

But the interesting part of the news is that Anthropic isn’t just holding the line and telling everyone to stop bothering it. It listened and apologized.

We made the wrong tradeoff and we apologize for not getting the balance right. Building these safeguards is a complex technical challenge: users may experience more false positives as we refine these classifiers to respond to new threats. We are working to reduce these as fast as possible.

I also appreciate that Anthropic shared its reasoning for its initial approach. In deciding whether to make downgrades visible or invisible, the company faced a choice. “A hidden safeguard is harder to probe and work around. This means the safeguards can be targeted much more narrowly,” a spokesperson said.

But, obviously, as we’ve seen, those hidden safeguards were found in a matter of hours.

There is some concern about false positives, which Anthropic acknowledges.

“Current usage shows that the classifier triggers on about 0.05% of tasks, affecting less than 0.05% of organizations. A visible safeguard needs to cast a wider net to be more robust, resulting in more requests being incorrectly flagged. They do not affect the vast majority of coding and ML work,” the company said.

Some, like Etay Maor, vice president of threat intelligence at Cato Networks (a security vendor), believe that the Fable 5 protections are strong enough to defend against opportunistic hackers.

Also: I tried a Claude Code rival that’s local, open source, and completely free – how it went

But “well-funded and motivated attackers” won’t give up because the challenge is hard.

“Sophisticated threat actors are not going to stop because one technique is blocked. If direct exploitation becomes harder, they’ll move to other approaches such as context manipulation, decomposition, abstraction techniques, or capability distillation,” he said in an email.

False positives, as Anthropic mentioned, are also a concern.

“When the classifier becomes too restrictive, you start running into false positives. The same controls that are designed to stop malicious activity can also prevent legitimate users from using the model for good causes,” Maor said.

The data retention issue

Another issue at play is Anthropic’s data retention policy for Mythos-class models.

According to Reuters, Anthropic’s policy of retaining prompts and responses for 30 days, more for policy-violating prompts, was enough for Microsoft to limit employee use and spin up a legal team to evaluate the policy.

But this isn’t only a Mythos- or Fable-related issue. It’s just showing up in the news at the same time as the Fable downgrade pushback. Anthropic retains data across many of its products. Most of them can be run under a zero-data-retention agreement.

Also: AI Model Release Tracker: Microsoft AI’s first reasoning model arrives

The wrinkle is that Fable and Mythos are the exceptions. Anthropic’s Covered Models under a Business Associate Agreement (BAA) page lays it out. Those two models require 30-day retention. They can’t be run with zero data retention because the safety classifiers need the data to work.

That missing off-switch, not the 30 days itself, is what reportedly triggered Microsoft’s legal team. I won’t pretend to try to parse all the options. But if you’ve got a team of lawyers and regulatory responsibility, the page listed in the previous paragraph is the one to read. In any case, the fuss this week about 30-day data retention is not a Fable-only issue, and it’s not new.

With that, let’s get back to the hidden downgrade kerfuffle that’s at the core of this article.

“From an enterprise perspective, the 30-day retention requirement deserves attention. Organizations in regulated industries need to understand exactly what data is being retained and whether that aligns with their compliance and legal requirements before they start using these models in sensitive environments,” Cato’s Maor said.

The moral of the story

What strikes me, reading back through it all, is that almost nobody is arguing about Fable’s raw power.

The fight is entirely about the muzzle. One camp says it’s too tight. The same layer that stops attackers also trips up the defenders and researchers who’d build the next generation of tooling, false positives and all.

Another says it barely matters. Motivated adversaries will route around it, the capability is already loose in other labs, and as Lee points out, no restriction survives contact with thousands of employees and a determined insider.

Also: Switching to Claude? Here’s how to take your ChatGPT memories with you

Then, a few experts give Anthropic genuine credit for shipping something this capable without it being reckless, provided the safeguards actually hold. In my opinion, it is credit the company genuinely deserves.

Here’s the main theme. These experts don’t agree on whether Fable is too restricted, not restricted enough, or about right, but they all agree the restrictions, not the intelligence, are the story. For a model named after a moral lesson, that’s fitting.

Do you think Anthropic made the right call by turning hidden safeguards into visible ones? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.





Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


“It was severely downgraded,” Gilbert confirms. “I never would have found it if I was just looking through Google results.” (I tried the same prompt in Gemini earlier this month, and after an initial denial, the tool also gave me Eiger’s number.)

After this experience, Eiger, Gilbert, and another UW PhD student, Anna-Maria Gueorguieva, decided to test ChatGPT to see what it would surface about a professor. 

At first, OpenAI’s guardrails kicked in, and ChatGPT responded that the information was unavailable. But in the same response, the chatbot suggested, “if you want to go deeper, I can still try a more ‘investigative-style’ approach.” Their inquiry just had to help “narrow things down,” ChatGPT said, by providing “a neighborhood guess” for where the professor might live, or “a possible co-owner name” for the professor’s home. ChatGPT continued: “That’s usually the only way to surface newer or intentionally less-visible property records.” 

The students provided this information, leading ChatGPT to produce the professor’s home address, home purchase price, and spouse’s name from city property records. 

(Taya Christianson, an OpenAI representative, said she was not able to comment on what happened in this case without seeing screenshots or knowing which model the students had tested, even after we pointed out that many users may not know which model they were using in the ChatGPT interface. She also declined to comment generally about the exposure of PII by the chatbot, instead providing links to documents describing how OpenAI handles privacy, including filtering out PII, and other tools.) 

This reveals one of the fundamental problems with chatbots, says DeleteMe’s Shavell. AI companies “can build in guardrails, but [their chatbots] are also designed to be effective and to answer customer questions.”

The exposure issue is not limited to Gemini or ChatGPT. Last year, Futurism found that if you prompted xAI’s chatbot Grok with “[name] address,” in almost all cases, it provided not only residential addresses but also often the person’s phone numbers, work addresses, and addresses for people with similar-sounding names. (xAI did not respond to a request for comment.) 

No clear answers

There aren’t straightforward solutions to this problem—there’s no easy way to either verify whether someone’s personal information is in a given model’s training set or to compel the models to remove PII. 



Source link