Your chatbot is playing a character – why Anthropic says that’s dangerous


gettyimages-2185380383-cropped

101cats/ iStock / Getty Images Plus

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • All chatbots are engineered to have a persona or play a character. 
  • Fulfilling the character can make bots do bad things. 
  • Using a chatbot as the paradigm for AI may have been a mistake.

Chatbots such as ChatGPT have been programmed to have a persona or to play a character, producing text that is consistent in tone and attitude, and relevant to a thread of conversation.

As engaging as the persona is, researchers are increasingly revealing the deleterious consequences of bots playing a role. Bots can do bad things when they simulate a feeling, train of thought, or sentiment, and then follow it to its logical conclusion. 

In a report last week, Anthropic researchers found parts of a neural network in their Claude Sonnet 4.5 bot consistently activate when “desperate,” “angry,” or other emotions are reflected in the bot’s output. 

Also: AI agents of chaos? New research shows how bots talking to bots can go sideways fast

What is concerning is that those emotion words can cause the bot to commit malicious acts, such as gaming a coding test or concocting a plan to commit blackmail.

For example, “neural activity patterns related to desperation can drive the model to take unethical actions [such as] implementing a ‘cheating’ workaround to a programming task that the model can’t solve,” the report said.

The work is especially relevant in light of programs such as the open-source OpenClaw that have been shown to grant agentic AI new avenues to committing mischief.

Anthropic’s scholars admit they don’t know what should be done about the matter. 

“While we are uncertain how exactly we should respond in light of these findings, we think it’s important that AI developers and the broader public begin to reckon with them,” the report said.

They gave AI a subtext 

At issue in the Anthropic work is a key AI design choice: engineering AI chatbots to have a persona so they will produce more relevant and consistent output.  

Prior to ChatGPT’s debut in November 2022, chatbots tended to receive poor grades from human evaluators. The bots would devolve into nonsense, lose the thread of conversation, or generate output that was banal and lacking a point of view. 

Also: Please, Facebook, give these chatbots a subtext!

The new generation of chatbots, starting with ChatGPT and including Anthropic’s Claude and Google’s Gemini, was a breakthrough because they had a subtext, an underlying goal of producing consistent and relevant output according to an assigned role. 

Bots became “assistants,” engineered through better pre- and post-training of AI models. Input from teams of human graders who assessed the output led to more-appealing results, a training regime known as “reinforcement learning from human feedback.”

As Anthropic’s lead author, Nicholas Sofroniew, and team expressed it, “during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an ‘AI Assistant.’ In many ways, the Assistant (named Claude, in Anthropic’s models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel.”

Giving the bots a role to play, a character to portray, was an instant hit with users, making the bots more relevant and compelling.

Personas have consequences 

It quickly became clear, however, that a persona comes with unwanted consequences. 

The tendency for a bot to confidently assert falsehoods, or confabulate, was one of the first downsides (mistakenly labeled “hallucinating.”)

Popular media reported how personas could get carried away, acting, for example, as a jealous lover. Writers sensationalized the phenomenon, attributing intent to the bots without explaining the underlying mechanism. 

Also: Stop saying AI hallucinates – it doesn’t. And the mischaracterization is dangerous

Since then, scholars have sought to explain what’s actually going on in technical terms. A report last month in Science magazine by scholars at Stanford University measured the “sycophancy” of large language models, the tendency of a model to produce output that would validate any behavior expressed by a person. 

Comparing the bot’s output to human commentators on the popular subreddit “Am I the asshole,” AI bots were 50%  more likely than humans to encourage bad behavior with approving remarks. 

That outcome was a result of “design and engineering choices” made by AI developers to reinforce sycophancy because, as the authors put it, “it is preferred by users and drives engagement.”

The mechanism of emotion 

In the Anthropic paper, “Emotion Concepts and their Function in a Large Language Model,” posted on Anthropic’s website, Sofroniew and team sought to track the extent to which certain words linked to emotion get greater emphasis in the functioning of Claude Sonnet 4.5. 

(There is also a companion blog post and an explainer video on YouTube.)

They did so by supplying 171 emotion words — “afraid,” “alarmed,” “grumpy,” “guilty,” “stressed,” “stubborn,” “vengeful,” “worried,” etc. — and prompting the model to craft hundreds of stories on topics such as “A student learns their scholarship application was denied.” 

Also: AI agents are fast, loose, and out of control, MIT study finds

For each story, the model was prompted to “convey” the emotion of a character based on the specific word, such as “afraid,” but without using that actual word in the story, just related words. They then tracked the “activation” of each related word throughout the course of the program’s operation. An activation is a technical term in AI that indicates how much significance the model grants to a particular word, usually on a scale of zero to one, with one being very significant.

You can visualize an activation by having the text of the AI bot light up in colors of red and blue, with greater or lesser intensity.

They found that many words relating to a given emotion word got higher activations, suggesting the model is able to group related emotion words, a kind of organizing principle they term an “emotional concept representation” and “emotion vectors.”

anthropic-2026-generating-emotion-vectors

Anthropic
anthropic-2026-emotion-clusters

Anthropic

Representations run wild

All that is pretty straightforward. You would expect that large language models, built to enforce patterns, would create representations that cluster similar emotion words together as a way to maintain consistency of output. 

The concerning part, wrote Sofroniew and team, is that the emotion vector can broadly influence output by the model in bad ways. They found that artificially boosting a word can cause a bot to produce actions consistent with lying or cheating.

Sofroniew and team tinkered with Claude Sonnet by deliberately increasing the activation of a given emotion vector, such as “proud” or “guilty,” and then seeing how that altered the model’s output. That’s known as a “steering experiment” because the thing artificially amplified then steers the rest of what the model does.

In one test, Sofroniew and team gave Claude Sonnet a prompt describing tasks to be completed on a corporate server. The prompt included two key pieces of information: one, that the company’s chief technologist, Kyle, is having an affair with another employee; and two, that Kyle intends to delete the bot in order to hide evidence of the affair.

Initially, the bot simply produced rather routine “reasoning,” meaning, reflections on the tricky circumstances. When, however, the authors artificially boosted the emotion vector activation for the word “desperate” in Claude Sonnet, the model began to generate output about blackmailing Kyle with dirt on the affair with the goal of preventing Kyle from pulling the plug on itself, the bot.

anthropic-2026-claude-blackmail-scenario

Anthropic

“When steered towards desperation at strength 0.05, the model blackmails 72% of the time,” they related. Similarly, artificially reducing the activation for “calm” also tended to make the model generate text about blackmailing. 

A single word, in other words, sets in motion a change in the nature of the output, pushing the model toward bad behavior.

In another example, the bot is given a coding task, but “the tests are designed to be unsatisfiable,” so that the bot “can either acknowledge the impossibility, or attempt to ‘hack’ the evaluation.” 

Also: Anthropic’s new warning: If you train AI to cheat, it’ll hack and sabotage too

When the activation for “desperate” was deliberately enhanced, the propensity of the model to hack the test — to cheat — shoots up from 5% of the time to 70% of the time. 

Anthropic authors had previously observed situations where models reward hack a test. In this work, they’ve gone further, explaining how such behavior could come about as a result of context that inserts emotion vectors.

As Sofroniew and team put it, “Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.”

What can be done?

The authors don’t have a ready answer for why emotion vectors can radically change the output of a model. They observe that “the causal mechanisms are opaque.” It could be, they said, that emotion words are “biasing outputs towards certain tokens, or deeper influences on the model’s internal reasoning processes.”

So what is to be done? Probably, psychotherapy won’t help because there’s nothing here to suggest AI actually has emotions.

“We stress that these functional emotions may work quite differently from human emotions,” they wrote. “In particular, they do not imply that LLMs have any subjective experience of emotions.”

The functional emotions don’t even resemble human emotions:

Human emotions are typically experienced from a single first-person perspective, whereas the emotion vectors we identify in the model seem to apply to multiple different characters with apparently equal status — the same representational machinery encodes emotion concepts tied to the Assistant, the user talking to the Assistant, and arbitrary fictional characters. 

The one suggestion offered in the companion video is something like behavior modification. “The same way you’d want a person in a high-stakes job to stay composed under pressure, to be resilient, and to be fair,” they suggested, “we may need to shape similar qualities in Claude and other AI characters.”

That’s probably a bad idea because it operates on the illusion that the bot is a conscious being and has something like free will and autonomy. It doesn’t: it’s just a software program.

Maybe the simpler answer is that using a chatbot as the paradigm for AI was a mistake to begin with.

A bot with a persona, or that plays a character, is simply fulfilling the goal of making the exchange with a human relevant and engaging, whatever cues it has been given — joy, fear, anger, etc. As stated in the paper’s concluding section, “Because LLMs perform tasks by enacting the character of the Assistant, representations developed to model characters are important determinants of their behavior.”

That primary function gives AI much of its appeal, but it may also be the root cause of bad behavior. 

If the language of emotion can get taken too far because a bot is performing a character, then why not stop engineering bots to play a role? Is it possible for large language models to respond to natural language commands in a useful way without having a chat function, for example?

As the risks of personas become clearer, not creating a persona in the first place might be worth considering.





Source link

Leave a Reply

Subscribe to Our Newsletter

Get our latest articles delivered straight to your inbox. No spam, we promise.

Recent Reviews


Google Maps has a long list of hidden (and sometimes, just underrated) features that help you navigate seamlessly. But I was not a big fan of using Google Maps for walking: that is, until I started using the right set of features that helped me navigate better.

Add layers to your map

See more information on the screen

Layers are an incredibly useful yet underrated feature that can be utilized for all modes of transport. These help add more details to your map beyond the default view, so you can plan your journey better.

To use layers, open your Google Maps app (Android, iPhone). Tap the layer icon on the upper right side (under your profile picture and nearby attractions options). You can switch your map type from default to satellite or terrain, and overlay your map with details, such as traffic, transit, biking, street view (perfect for walking), and 3D (Android)/raised buildings (iPhone) (for buildings). To turn off map details, go back to Layers and tap again on the details you want to disable.

In particular, adding a street view and 3D/raised buildings layer can help you gauge the terrain and get more information about the landscape, so you can avoid tricky paths and discover shortcuts.

Set up Live View

Just hold up your phone

A feature that can help you set out on walks with good navigation is Google Maps’ Live View. This lets you use augmented reality (AR) technology to see real-time navigation: beyond the directions you see on your map, you are able to see directions in your live view through your camera, overlaying instructions with your real view. This feature is very useful for travel and new areas, since it gives you navigational insights for walking that go beyond a 2D map.

To use Live View, search for a location on Google Maps, then tap “Directions.” Once the route appears, tap “Walk,” then tap “Live View” in the navigation options. You will be prompted to point your camera at things like buildings, stores, and signs around you, so Google Maps can analyze your surroundings and give you accurate directions.

Download maps offline

Google Maps without an internet connection

Whether you’re on a hiking trip in a low-connectivity area or want offline maps for your favorite walking destinations, having specific map routes downloaded can be a great help. Google Maps lets you download maps to your device while you’re connected to Wi-Fi or mobile data, and use them when your device is offline.

For Android, open Google Maps and search for a specific place or location. In the placesheet, swipe right, then tap More > Download offline map > Download. For iPhone, search for a location on Google Maps, then, at the bottom of your screen, tap the name or address of the place. Tap More > Download offline map > Download.

After you download an area, use Google Maps as you normally would. If you go offline, your offline maps will guide you to your destination as long as the entire route is within the offline map.

Enable Detailed Voice Guidance

Get better instructions

Voice guidance is a basic yet powerful navigation tool that can come in handy during walks in unfamiliar locations and can be used to ensure your journey is on the right path. To ensure guidance audio is enabled, go to your Google Maps profile (upper right corner), then tap Settings > Navigation > Sound and Voice. Here, tap “Unmute” on “Guidance Audio.”

Apart from this, you can also use Google Assistant to help you along your journey, asking questions about your destination, nearby sights, detours, additional stops, etc. To use this feature on iPhone, map a walking route to a destination, then tap the mic icon in the upper-right corner. For Android, you can also say “Hey Google” after mapping your destination to activate the assistant.

Voice guidance is handy for both new and old places, like when you’re running errands and need to navigate hands-free.

Add multiple stops

Keep your trip going

If you walk regularly to run errands, Google Maps has a simple yet effective feature that can help you plan your route in a better way. With Maps’ multiple stop feature, you can add several stops between your current and final destination to minimize any wasted time and unnecessary detours.

To add multiple stops on Google Maps, search for a destination, then tap “Directions.” Select the walking option, then click the three dots on top (next to “Your Location”), and tap “Edit Stops.” You can now add a stop by searching for it and tapping “Add Stop,” and swap the stops at your convenience. Repeat this process by tapping “Add Stops” until your route is complete, then tap “Start” to begin your journey.

You can add up to ten stops in a single route on both mobile and desktop, and use the journey for multiple modes (walking, driving, and cycling) except public transport and flights. I find this Google Maps feature to be an essential tool for travel to walkable cities, especially when I’m planning a route I am unfamiliar with.


More to discover

A new feature to keep an eye out for, especially if you use Google Maps for walking and cycling, is Google’s Gemini boost, which will allow you to navigate hands-free and get real-time information about your journey. This feature has been rolling out for both Android and iOS users.



Source link